Options:
- # Session Start: Sun Jul 15 00:00:00 2007
- # Session Ident: #html-wg
- # [00:00] * Joins: hyatt (hyatt@24.6.91.161)
- # [00:02] <Philip`> I remember looking a long time ago at the pages I found using <footer>, and they just looked far more like old buggy HTML with random made-up tags than like early adopters of HTML5 :-)
- # [00:04] <zcorpan_> ok. was the usage of <footer> incompatible with html5?
- # [00:04] <Philip`> (It would be nice if all the collected statistics could be linked back to the pages they came from - I'll see if I can use that, if it's not going to take huge amounts of disk space...)
- # [00:05] <Philip`> http://www.classesusa.com/schools/campus/it.html
- # [00:06] <Philip`> (The other <footer> was on the same site as that one)
- # [00:07] <Philip`> Oops, the <header> was actually just a </header>, in http://home.comcast.net/~chris.s/myth.html
- # [00:08] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [00:11] * Joins: sbuluf (wzfdycu@200.49.140.174)
- # [00:14] * Joins: gavin (gavin@74.103.208.221)
- # [00:16] * Quits: hyatt (hyatt@24.6.91.161) (Quit: hyatt)
- # [00:16] * Quits: tinfish (tinfish@84.92.181.183) (Quit: tinfish)
- # [00:22] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
- # [00:31] * Quits: Lachy (chatzilla@203.214.140.60) (Quit: ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502])
- # [00:42] * Joins: mjs (mjs@64.81.48.145)
- # [01:41] * Joins: hyatt (hyatt@24.6.91.161)
- # [01:46] * Joins: myakura (myakura@58.88.37.26)
- # [01:51] * Quits: myakura (myakura@58.88.37.26) (Quit: Leaving...)
- # [01:55] * Quits: hyatt (hyatt@24.6.91.161) (Quit: hyatt)
- # [02:03] * Quits: tH (Rob@87.102.36.227) (Quit: ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508])
- # [02:04] <Philip`> http://canvex.lazyilluminati.com/misc/stats/analyse.cgi/index - I've replaced the dataset with the Alexa Top 500 pages
- # [02:05] <Philip`> It's interesting to see the prevalence of <script> (on about 93% of pages), compared to http://code.google.com/webstats/2005-12/scripting.html finding it on roughly half
- # [02:16] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [02:21] * Joins: gavin (gavin@74.103.208.221)
- # [02:32] * Quits: Sander (svl@86.87.68.167) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [03:45] * Quits: mjs (mjs@64.81.48.145) (Quit: mjs)
- # [04:06] * Joins: mjs (mjs@64.81.48.145)
- # [07:10] * Joins: Lachy (chatzilla@203.214.140.60)
- # [11:00] * Joins: Fred (fred@84.6.240.69)
- # [11:00] * Parts: Fred (fred@84.6.240.69)
- # [11:31] * Joins: tH_ (Rob@87.102.36.227)
- # [11:31] * tH_ is now known as tH
- # [12:35] * Joins: zcorpan_ (zcorpan@90.229.146.10)
- # [12:57] * Joins: Sander (svl@86.87.68.167)
- # [13:33] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [13:38] * Joins: gavin (gavin@74.103.208.221)
- # [14:23] * Quits: oedipus (oedipus@71.250.56.243) (Ping timeout)
- # [15:04] * Joins: ROBOd (robod@86.34.246.154)
- # [15:26] * Quits: Sander (svl@86.87.68.167) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [15:38] * Joins: kazuhito (kazuhito@222.151.186.76)
- # [15:58] * Joins: Sander (svl@86.87.68.167)
- # [15:59] * Joins: edas (edaspet@88.191.34.123)
- # [16:31] * Quits: Sander (svl@86.87.68.167) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [16:46] * Quits: tH (Rob@87.102.36.227) (Ping timeout)
- # [16:49] * Joins: tH (Rob@87.102.36.227)
- # [17:04] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [17:09] * Joins: gavin (gavin@74.103.208.221)
- # [18:05] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
- # [19:39] * Quits: kazuhito (kazuhito@222.151.186.76) (Quit: Quitting!)
- # [19:45] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [19:50] * Joins: gavin (gavin@74.103.208.221)
- # [19:54] * Joins: Sander (svl@86.87.68.167)
- # [21:02] * Quits: sbuluf (wzfdycu@200.49.140.174) (Ping timeout)
- # [21:07] * Joins: sbuluf (fgwg@200.49.140.174)
- # [21:27] * Quits: xover (xover@193.157.66.5) (Ping timeout)
- # [21:36] * Quits: Sander (svl@86.87.68.167) (Ping timeout)
- # [21:36] * Joins: xover (xover@193.157.66.5)
- # [21:45] * Joins: Lionheart (robin@66.57.69.65)
- # [21:46] <Philip`> Of the front pages of the top 500 sites, www.w3.org contains 79% of the <acronym>s and 97% of the <abbr>s
- # [21:51] <hsivonen> Philip`: do you have a survey framework that others can run?
- # [21:52] <Philip`> I'm trying to build one up at the moment
- # [21:53] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [21:53] <Philip`> though I've not tried to do anything good about downloading a good sample of pages yet, which is why I'm just testing with that list of 500 sites for now
- # [21:53] <hsivonen> I wonder if the dmoz data dump could be considered a representative samle of pages
- # [21:54] <hsivonen> would it be biased towards old pages and front pages?
- # [21:54] <Philip`> That's what http://triin.net/2006/06/12/Selection_of_pages used
- # [21:55] <Philip`> Would it be biased towards English too?
- # [21:55] <hsivonen> dunno. probably
- # [21:56] <hsivonen> although if one wants to analyze, for example, what fallback content tends to say, one would be better off scraping text that one can actually read and categorize
- # [21:57] * hsivonen notes that dmoz still carries a Netscape copyright notice
- # [21:57] <Philip`> Non-English sites seem to be quite different to English ones - e.g. http://www.xinhuanet.com/ has a thousand <td>s, which seems quite insane, but it's just as important that HTML5 isn't incompatible with those sites
- # [21:58] * Joins: gavin (gavin@74.103.208.221)
- # [21:59] <Philip`> (Of the top 12 <td> abusers in my collection of pages, ign.com is the only English one)
- # [22:00] * hsivonen passes tests4.dat
- # [22:00] <hsivonen> oops. didn't pass after all
- # [22:03] <hsivonen> passing it now
- # [22:46] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
- # [22:47] * Joins: Sander (svl@86.87.68.167)
- # [23:04] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
- # [23:24] <Philip`> hsivonen / html5lib people: http://canvex.lazyilluminati.com/misc/stats/tokeniser.html gives the frequency of each step of the tokeniser algorithm, in case that's interesting for knowing which bits to optimise
- # [23:25] <Philip`> (It records the current state and the C++ code for the first conditional which succeeded, with "true" being the "not yet handled" parts)
- # [23:25] <hsivonen> Philip`: thank you. on the face of it, the frequencies suggest that I should optimize away my additional buffers
- # [23:25] <hsivonen> as they are used in the most common steps
- # [23:40] <Philip`> This download-loads-of-HTML-pages idea would be much easier if I had more than 200MB of free disk space left
- # Session Close: Mon Jul 16 00:00:00 2007
The end :)