Options:
- # Session Start: Mon Jul 16 00:00:00 2007
- # Session Ident: #html-wg
- # [00:00] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [00:05] * Quits: deltab (deltab@82.36.30.34) (Client exited)
- # [00:05] * Joins: gavin (gavin@74.103.208.221)
- # [00:06] * Joins: deltab (deltab@82.36.30.34)
- # [00:15] <Philip`> Aha, a good use of educational resources...
- # [00:16] <Philip`> http://people.pwf.cam.ac.uk/pjt47/html/dmoz-unique-pages.txt.gz (~30MB) has dmoz.org's 4.5M URLs, with duplicates removed, in case somebody wants that list without downloading the ~300MB of RDF data
- # [00:18] * Quits: heycam (cam@203.214.127.179) (Ping timeout)
- # [00:54] * Quits: Lachy (chatzilla@203.214.140.60) (Quit: ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502])
- # [00:55] * Quits: bogi (bogi@153.19.120.250) (Ping timeout)
- # [01:07] * Quits: tH (Rob@87.102.36.227) (Quit: ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508])
- # [01:14] * Joins: schepers (schepers@128.30.52.30)
- # [01:18] * Joins: heycam (cam@130.194.72.84)
- # [01:25] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
- # [01:25] * Joins: heycam (cam@130.194.72.84)
- # [01:48] * Joins: Lachy (chatzilla@203.214.140.60)
- # [02:08] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [02:13] * Joins: gavin (gavin@74.103.208.221)
- # [02:22] * Quits: Sander (svl@86.87.68.167) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [02:46] * Quits: Lionheart (robin@66.57.69.65) (Ping timeout)
- # [04:07] * Philip` can download and collect statistics about web pages at a rate of about 5 per second on a single machine, which doesn't seem too bad
- # [04:15] <Philip`> http://www.sebascos.dk/ - by far the winner in the number-of-<head>s-on-one-page contest; plus it's got cats
- # [04:16] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [04:21] * Joins: gavin (gavin@74.103.208.221)
- # [06:24] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [06:29] * Joins: gavin (gavin@74.103.208.221)
- # [07:54] * Quits: mjs (mjs@64.81.48.145) (Quit: mjs)
- # [07:55] * Joins: mjs (mjs@64.81.48.145)
- # [08:12] * Quits: xover (xover@193.157.66.5) (Ping timeout)
- # [08:31] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [08:32] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [08:36] * Joins: gavin (gavin@74.103.208.221)
- # [08:37] * Quits: sbuluf (fgwg@200.49.140.174) (Ping timeout)
- # [08:51] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
- # [08:52] * Joins: xover (xover@193.157.66.5)
- # [09:02] * Quits: schepers (schepers@128.30.52.30) (Client exited)
- # [09:30] * Joins: Zeros (Zeros-Elip@67.154.87.254)
- # [09:33] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
- # [10:01] * Joins: bogi (bogi@153.19.120.250)
- # [10:12] <hsivonen> Philip`: I wonder how those <head>s ended up there
- # [10:13] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [10:38] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [10:43] * Joins: gavin (gavin@74.103.208.221)
- # [10:47] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
- # [10:54] * Joins: heycam (cam@203.214.127.179)
- # [11:04] * Joins: ROBOd (robod@86.34.246.154)
- # [11:24] * Joins: Lionheart (robin@66.57.69.65)
- # [12:09] * Quits: beowulf (carisenda@91.84.50.132) (Ping timeout)
- # [12:30] * Joins: tH (Rob@87.102.36.227)
- # [12:47] * Quits: Lionheart (robin@66.57.69.65) (Ping timeout)
- # [13:01] * Joins: zcorpan_ (zcorpan@90.229.146.10)
- # [13:31] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [13:37] * Joins: gavin (gavin@74.103.208.221)
- # [13:39] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
- # [14:01] * Joins: schepers (schepers@128.30.52.30)
- # [14:16] <Philip`> http://encarta.msn.com/encyclopedia_761579147/William_I_(of_England).html has lots of <div style="clear:left" />, resulting in unclosed divs - XML seems to cause as much confusion as it solves
- # [14:22] * Joins: zcorpan_ (zcorpan@90.229.146.10)
- # [14:22] <zcorpan_> hsivonen: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012070.html
- # [14:23] <zcorpan_> hsivonen: the html5lib tests are ahead of the spec :)
- # [14:26] <Philip`> If you changed that text, you'd have to change the "In the RCDATA and CDATA states, a further escape flag is used to control the behaviour of the tokeniser" too since it'll apply to PCDATA
- # [14:28] <Philip`> though I guess it isn't relevant to PCDATA, so it should be more like "When the content model flag is set to the PCDATA state, or when it is set to the RCDATA state and the escape flag is false, ...", perhaps
- # [14:33] <zcorpan_> the escape flag can't be true in the pcdata state
- # [14:36] <zcorpan_> so ((pcdata || rcdata) && !escape_flag) is the same as (pcdata || (rcdata && !escape_flag))
- # [14:40] <hsivonen> zcorpan_: ok.
- # [14:42] <hsivonen> jgraham: it would be useful for me and presumable for anyone else writing a streaming parser if test cases with non-streamable error recovery were is separate .dat files
- # [14:43] <hsivonen> jgraham: is it OK to move stuff around so that each .dat either contains non-streamable cases or streamable cases?
- # [15:11] * Joins: edas (edaspet@88.191.34.123)
- # [15:18] * Joins: gorme (gorm@213.236.208.22)
- # [15:38] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [15:43] * Joins: gavin (gavin@74.103.208.221)
- # [15:45] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Client exited)
- # [15:54] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [16:21] * Joins: billmason (billmason@69.30.57.156)
- # [16:25] * Joins: tH_ (Rob@87.102.76.26)
- # [16:27] * Quits: tH (Rob@87.102.36.227) (Ping timeout)
- # [16:27] * tH_ is now known as tH
- # [16:37] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
- # [16:47] * Joins: edas (edaspet@88.191.34.123)
- # [17:12] * Joins: kazuhito (kazuhito@222.151.186.182)
- # [17:31] * Joins: Lionheart (robin@198.86.248.1)
- # [17:46] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [17:48] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
- # [17:51] * Joins: gavin (gavin@74.103.208.221)
- # [17:57] * Quits: kazuhito (kazuhito@222.151.186.182) (Quit: Quitting!)
- # [18:05] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
- # [18:09] * Joins: Sander (svl@86.87.68.167)
- # [18:42] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
- # [18:45] <Philip`> http://canvex.lazyilluminati.com/misc/stats/2/analyse.cgi/index
- # [18:45] <Philip`> now with 8192 pages
- # [18:46] <Philip`> and with not especially great scalability, so it's starting to go a bit slowly :-(
- # [18:46] <Philip`> (mainly since it stores all the details for each page, rather than just aggregate statistics)
- # [18:47] <zcorpan_> Philip`: you may want to be careful with the usage of the phrase "random sample"
- # [18:47] <hsivonen> Philip`: the frequency of "td" suggests to me that shunning layout tables is tilting against the windmills and doesn't serve the needs of authors
- # [18:48] <Philip`> How careful do I have to be about 'random sample' when there's a well-defined list, and I'm just shuffling the whole list then picking out the first n items?
- # [18:48] <hsivonen> Philip`: "picked n entries from list foo at random" will pre-emptively protect against certain comments :-)
- # [18:50] <Philip`> The data still has interesting biases, e.g. www.weather.com/<stuff> comes up 6760 times in dmoz.org's list
- # [18:51] <Philip`> I'll attempt to get around to uploading the code I'm using for this stuff
- # [18:52] <Philip`> (It only took 15 minutes to collect the data about 8192 pages, so it should be easy enough for other people to do the same)
- # [18:52] <hsivonen> Philip`: are you subscribed to public-html yet?
- # [18:53] <zcorpan_> 56.7% don't have a doctype
- # [18:53] <hsivonen> speaking of doctype, the DOM API design around doctypes just sucks
- # [18:54] <hsivonen> it sucks so much that I'm leaving doctype support out of my DOM tree builder impl
- # [18:54] <Philip`> hsivonen: Not yet, since I was lazy for a while and didn't have anything interesting to say, and then I thought I might as well join anyway so now I'm just waiting for the application to get handled
- # [18:54] * zcorpan_ wonders what doctype dom apis are good for
- # [18:54] * Philip` should probably work out how to cache the front page of his results page
- # [18:55] <hsivonen> zcorpan_: nothing that isn't harmful, as far as I can tell
- # [18:56] <hsivonen> the main reason for supporting doctypes in the native tree API of my parser (I call it SAX Tree) is running html5lib test cases
- # [18:56] <hsivonen> I indend to turn doctype nodes off by default
- # [18:56] <hsivonen> so that hopefully fewer people shoot themselves in the foot with them
- # [18:56] <hsivonen> intend even
- # [18:56] <Philip`> <td headers> is on 4 pages, <td scope> on 14, <th scope> on 45
- # [18:57] <hsivonen> Philip`: any signs of an authoring tool besides a text editor being used for those pages?
- # [18:57] <Philip`> Three of those four with <td headers> are census.gov
- # [18:59] <Philip`> http://www.tppinternet.com/ puts scope="row" all over its layout tables
- # [19:00] <Philip`> (http://canvex.lazyilluminati.com/misc/stats/2/analyse.cgi/attr/scope has a list of relevant sites)
- # [19:00] <Philip`> (It only shows the top 20 - would it be worth expanding that list?)
- # [19:03] <Philip`> hsivonen: http://www.calicorestaurant.com/ and http://www.innodev.fi/ have some "<!-- InstanceBegin template ..." stuff that looks like a tool was involved (putting scope onto what looks like just layout tables)
- # [19:04] * Joins: Lionheart (robin@198.86.248.1)
- # [19:04] <Philip`> http://www.harneydh.com/ has some <!--DWLayoutTable--> - Dreamweaver?
- # [19:05] <Philip`> Those seem to be examples of accidental scope usage
- # [19:07] <hsivonen> kind of sad if tools put scope on layout tables
- # [19:08] <hsivonen> BTW, tree-buffered SAX without XML 1.0 compat options is now runnable and perhaps even usable in the whattf svn
- # [19:11] <Philip`> Most of the legitimate @scope I can see is on calendars
- # [19:23] <Philip`> http://members.aol.com/westshoretheatre/ - <!doctype html public "-//"AOL Hometown//html 3.0 transitional//en">, a few pages down after several tables and scripts - I don't think they've quite got the hang of this
- # [19:25] <Philip`> (That would put IE in standards mode (if it was actually at the top of the document), but HTML5/etc goes into quirks mode)
- # [19:29] <Philip`> http://www.magneticsforyou.com/ - that site doesn't work at all well in Opera :-(
- # [19:33] * Quits: schepers (schepers@128.30.52.30) (Client exited)
- # [19:34] * Joins: schepers (schepers@128.30.52.30)
- # [19:54] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [19:59] * Joins: gavin (gavin@74.103.208.221)
- # [20:04] * Quits: tH (Rob@87.102.76.26) (Ping timeout)
- # [20:12] * Joins: hasather (hasather@81.235.209.174)
- # [20:53] * Joins: tH (Rob@87.102.76.26)
- # [21:07] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
- # [21:11] <zcorpan_> Philip`: in your sample, 3.5% have duplicate style attributes
- # [21:11] <zcorpan_> that's pretty much
- # [21:13] <Philip`> zcorpan_: Shouldn't that be 0.35%?
- # [21:13] <Philip`> (27 out of 7739)
- # [21:15] <Philip`> (Incidentally, I need to fix my tables so they say the percentage of pages which have some feature - the current way is quite misleading...)
- # [21:28] <zcorpan_> Philip`: ah, yes.
- # [21:29] <zcorpan_> still pretty high
- # [21:29] <zcorpan_> and yes, percentages are more useful than numbers :)
- # [21:31] <zcorpan_> 0.19% with <image> tags
- # [21:32] <zcorpan_> "As of 2005-12, studies showed that around 0.2% of pages used the <image> element."
- # [21:34] <Philip`> "0.19%" is a bit optimistic in terms of the number of significant figures, given the sample size :-)
- # [21:35] <Philip`> http://www.imdb.com/ - point people there if you want to show them why web browsers have to support <image>
- # [21:35] <zcorpan_> ~ 0.2%
- # [21:35] <zcorpan_> which is the same as what Hixie got
- # [21:36] <Philip`> I should probably try to find the margin of error on these numbers, but that sounds too much like hard work
- # [21:59] * Joins: dbaron (dbaron@63.245.220.241)
- # [22:01] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [22:06] * Joins: gavin (gavin@74.103.208.221)
- # [22:27] * Joins: hyatt (hyatt@17.203.15.144)
- # [22:39] <hsivonen> I wonder why some people on the list are so keen on forcing their source aesthetics on other people
- # [22:42] * zcorpan_ too
- # [22:46] <Philip`> http://www.city-data.com/city/Hardy-Iowa.html - ooh, a <canvas>
- # [22:46] <Philip`> via PlotKit
- # [22:51] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
- # [22:55] <jgraham> html5lib now passes all of it's own testcases again!
- # [22:55] <jgraham> (this hasn't been true for some days)
- # [22:55] <jgraham> TODO:
- # [22:55] <jgraham> New character encoding detection stuff
- # [22:56] <jgraham> Make performance suck less (I suspect without testing that we regressed by a factor of ~2 when the input stream got rewritten)
- # [22:56] <jgraham> Make a release
- # [22:56] <jgraham> Not necessarily in that order
- # [22:57] * jgraham also doesn't see the value in long discussions about source formatting on the list
- # [22:57] <Philip`> To fix performance, you should do a cHTMLTokenizer and improve by ~2 orders of magnitude ;-)
- # [22:57] * Quits: dbaron (dbaron@63.245.220.241) (Quit: 8403864 bytes have been tenured, next gc will be global.)
- # [22:58] <jgraham> Philip`: Then we'd just move the bottleneck somewhere else
- # [22:59] <jgraham> I think with careful profiling we could maybe improve by a factor 5 over the current perf but I'm not sure we can do much better without a full rewrite
- # [23:00] <Philip`> Incidentally, I saw comments in the html5lib code about finding the frequency of each case so they can be ordered better - have you seen http://canvex.lazyilluminati.com/misc/stats/tokeniser.html ?
- # [23:00] <jgraham> (Maybe even a factor 5 is wildly optimistic)
- # [23:00] <jgraham> (and I think it would require more changes than I think are good)
- # [23:01] <jgraham> Philip`: Yeah. Maybe Anne will want to work on that
- # [23:03] * Parts: hasather (hasather@81.235.209.174)
- # [23:04] * Joins: hasather (hasather@81.235.209.174)
- # [23:05] <jgraham> (Oh and the stats are cool. Are you planning to implement the treebuilder?)
- # [23:09] <Philip`> (I am planning that, though by 'planning' I just mean I think it'd probably be a good thing to do, and not that I've done any actual planning or have any idea of what's involved or when I'll find time to do it)
- # [23:11] <Philip`> (But I do like the transform-OCaml-into-C++-(or-JS-or-etc) approach, so I'd do the tree builder like that too)
- # [23:40] * Joins: dbaron (dbaron@63.245.220.241)
- # [23:41] <zcorpan_> Philip`: from your stats: quirks: 83%, limited quirks: 19%, no quirks: 3%
- # [23:44] <zcorpan_> Philip`: which is 105% in total, so some pages must have more than 1 doctype
- # [23:45] <Philip`> zcorpan_: Oops, looks like "None" includes the pages that weren't successfully downloaded
- # [23:45] <zcorpan_> ah
- # [23:45] <Philip`> Multiply everything by 7739/8192
- # [23:46] <Philip`> and then ignore ~1% error since I was only listing the top 100 doctypes, and there were 162 unique ones in total
- # [23:46] * Quits: xover (xover@193.157.66.5) (Ping timeout)
- # [23:47] <Philip`> Oh, and it seems 14 pages did have multiple doctypes
- # [23:47] <zcorpan_> not just substract 453 from None?
- # [23:47] <Philip`> But then these numbers are already a bit inaccurate since they don't care whether the doctype was the first token
- # [23:47] <zcorpan_> indeed
- # [23:48] * Joins: xover (xover@193.157.66.5)
- # [23:49] <zcorpan_> quirks: 77%, limited quirks: 19%, no quirks: 3%
- # [23:49] <zcorpan_> a bit different from what i expected
- # [23:50] <zcorpan_> (which was 90%, 9%, 1%)
- # [23:50] <Philip`> Oops, yes, subtract from None - that was calculated as 8192 - (number of pages with >= 1 doctype)
- # [23:50] <zcorpan_> ok
- # [23:51] <zcorpan_> movie tiem now
- # [23:53] <Philip`> Fixed the script so it calculates 'none' more correctly now
- # [23:54] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
- # [23:57] * Quits: dbaron (dbaron@63.245.220.241) (Quit: 8403864 bytes have been tenured, next gc will be global.)
- # Session Close: Tue Jul 17 00:00:00 2007
The end :)