/irc-logs / w3c / #html-wg / 2007-07-16 / end

Options:

  1. # Session Start: Mon Jul 16 00:00:00 2007
  2. # Session Ident: #html-wg
  3. # [00:00] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  4. # [00:05] * Quits: deltab (deltab@82.36.30.34) (Client exited)
  5. # [00:05] * Joins: gavin (gavin@74.103.208.221)
  6. # [00:06] * Joins: deltab (deltab@82.36.30.34)
  7. # [00:15] <Philip`> Aha, a good use of educational resources...
  8. # [00:16] <Philip`> http://people.pwf.cam.ac.uk/pjt47/html/dmoz-unique-pages.txt.gz (~30MB) has dmoz.org's 4.5M URLs, with duplicates removed, in case somebody wants that list without downloading the ~300MB of RDF data
  9. # [00:18] * Quits: heycam (cam@203.214.127.179) (Ping timeout)
  10. # [00:54] * Quits: Lachy (chatzilla@203.214.140.60) (Quit: ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502])
  11. # [00:55] * Quits: bogi (bogi@153.19.120.250) (Ping timeout)
  12. # [01:07] * Quits: tH (Rob@87.102.36.227) (Quit: ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508])
  13. # [01:14] * Joins: schepers (schepers@128.30.52.30)
  14. # [01:18] * Joins: heycam (cam@130.194.72.84)
  15. # [01:25] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
  16. # [01:25] * Joins: heycam (cam@130.194.72.84)
  17. # [01:48] * Joins: Lachy (chatzilla@203.214.140.60)
  18. # [02:08] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  19. # [02:13] * Joins: gavin (gavin@74.103.208.221)
  20. # [02:22] * Quits: Sander (svl@86.87.68.167) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
  21. # [02:46] * Quits: Lionheart (robin@66.57.69.65) (Ping timeout)
  22. # [04:07] * Philip` can download and collect statistics about web pages at a rate of about 5 per second on a single machine, which doesn't seem too bad
  23. # [04:15] <Philip`> http://www.sebascos.dk/ - by far the winner in the number-of-<head>s-on-one-page contest; plus it's got cats
  24. # [04:16] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  25. # [04:21] * Joins: gavin (gavin@74.103.208.221)
  26. # [06:24] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  27. # [06:29] * Joins: gavin (gavin@74.103.208.221)
  28. # [07:54] * Quits: mjs (mjs@64.81.48.145) (Quit: mjs)
  29. # [07:55] * Joins: mjs (mjs@64.81.48.145)
  30. # [08:12] * Quits: xover (xover@193.157.66.5) (Ping timeout)
  31. # [08:31] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  32. # [08:32] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  33. # [08:36] * Joins: gavin (gavin@74.103.208.221)
  34. # [08:37] * Quits: sbuluf (fgwg@200.49.140.174) (Ping timeout)
  35. # [08:51] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
  36. # [08:52] * Joins: xover (xover@193.157.66.5)
  37. # [09:02] * Quits: schepers (schepers@128.30.52.30) (Client exited)
  38. # [09:30] * Joins: Zeros (Zeros-Elip@67.154.87.254)
  39. # [09:33] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
  40. # [10:01] * Joins: bogi (bogi@153.19.120.250)
  41. # [10:12] <hsivonen> Philip`: I wonder how those <head>s ended up there
  42. # [10:13] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  43. # [10:38] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  44. # [10:43] * Joins: gavin (gavin@74.103.208.221)
  45. # [10:47] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
  46. # [10:54] * Joins: heycam (cam@203.214.127.179)
  47. # [11:04] * Joins: ROBOd (robod@86.34.246.154)
  48. # [11:24] * Joins: Lionheart (robin@66.57.69.65)
  49. # [12:09] * Quits: beowulf (carisenda@91.84.50.132) (Ping timeout)
  50. # [12:30] * Joins: tH (Rob@87.102.36.227)
  51. # [12:47] * Quits: Lionheart (robin@66.57.69.65) (Ping timeout)
  52. # [13:01] * Joins: zcorpan_ (zcorpan@90.229.146.10)
  53. # [13:31] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  54. # [13:37] * Joins: gavin (gavin@74.103.208.221)
  55. # [13:39] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
  56. # [14:01] * Joins: schepers (schepers@128.30.52.30)
  57. # [14:16] <Philip`> http://encarta.msn.com/encyclopedia_761579147/William_I_(of_England).html has lots of <div style="clear:left" />, resulting in unclosed divs - XML seems to cause as much confusion as it solves
  58. # [14:22] * Joins: zcorpan_ (zcorpan@90.229.146.10)
  59. # [14:22] <zcorpan_> hsivonen: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012070.html
  60. # [14:23] <zcorpan_> hsivonen: the html5lib tests are ahead of the spec :)
  61. # [14:26] <Philip`> If you changed that text, you'd have to change the "In the RCDATA and CDATA states, a further escape flag is used to control the behaviour of the tokeniser" too since it'll apply to PCDATA
  62. # [14:28] <Philip`> though I guess it isn't relevant to PCDATA, so it should be more like "When the content model flag is set to the PCDATA state, or when it is set to the RCDATA state and the escape flag is false, ...", perhaps
  63. # [14:33] <zcorpan_> the escape flag can't be true in the pcdata state
  64. # [14:36] <zcorpan_> so ((pcdata || rcdata) && !escape_flag) is the same as (pcdata || (rcdata && !escape_flag))
  65. # [14:40] <hsivonen> zcorpan_: ok.
  66. # [14:42] <hsivonen> jgraham: it would be useful for me and presumable for anyone else writing a streaming parser if test cases with non-streamable error recovery were is separate .dat files
  67. # [14:43] <hsivonen> jgraham: is it OK to move stuff around so that each .dat either contains non-streamable cases or streamable cases?
  68. # [15:11] * Joins: edas (edaspet@88.191.34.123)
  69. # [15:18] * Joins: gorme (gorm@213.236.208.22)
  70. # [15:38] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  71. # [15:43] * Joins: gavin (gavin@74.103.208.221)
  72. # [15:45] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Client exited)
  73. # [15:54] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  74. # [16:21] * Joins: billmason (billmason@69.30.57.156)
  75. # [16:25] * Joins: tH_ (Rob@87.102.76.26)
  76. # [16:27] * Quits: tH (Rob@87.102.36.227) (Ping timeout)
  77. # [16:27] * tH_ is now known as tH
  78. # [16:37] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
  79. # [16:47] * Joins: edas (edaspet@88.191.34.123)
  80. # [17:12] * Joins: kazuhito (kazuhito@222.151.186.182)
  81. # [17:31] * Joins: Lionheart (robin@198.86.248.1)
  82. # [17:46] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  83. # [17:48] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
  84. # [17:51] * Joins: gavin (gavin@74.103.208.221)
  85. # [17:57] * Quits: kazuhito (kazuhito@222.151.186.182) (Quit: Quitting!)
  86. # [18:05] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
  87. # [18:09] * Joins: Sander (svl@86.87.68.167)
  88. # [18:42] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
  89. # [18:45] <Philip`> http://canvex.lazyilluminati.com/misc/stats/2/analyse.cgi/index
  90. # [18:45] <Philip`> now with 8192 pages
  91. # [18:46] <Philip`> and with not especially great scalability, so it's starting to go a bit slowly :-(
  92. # [18:46] <Philip`> (mainly since it stores all the details for each page, rather than just aggregate statistics)
  93. # [18:47] <zcorpan_> Philip`: you may want to be careful with the usage of the phrase "random sample"
  94. # [18:47] <hsivonen> Philip`: the frequency of "td" suggests to me that shunning layout tables is tilting against the windmills and doesn't serve the needs of authors
  95. # [18:48] <Philip`> How careful do I have to be about 'random sample' when there's a well-defined list, and I'm just shuffling the whole list then picking out the first n items?
  96. # [18:48] <hsivonen> Philip`: "picked n entries from list foo at random" will pre-emptively protect against certain comments :-)
  97. # [18:50] <Philip`> The data still has interesting biases, e.g. www.weather.com/<stuff> comes up 6760 times in dmoz.org's list
  98. # [18:51] <Philip`> I'll attempt to get around to uploading the code I'm using for this stuff
  99. # [18:52] <Philip`> (It only took 15 minutes to collect the data about 8192 pages, so it should be easy enough for other people to do the same)
  100. # [18:52] <hsivonen> Philip`: are you subscribed to public-html yet?
  101. # [18:53] <zcorpan_> 56.7% don't have a doctype
  102. # [18:53] <hsivonen> speaking of doctype, the DOM API design around doctypes just sucks
  103. # [18:54] <hsivonen> it sucks so much that I'm leaving doctype support out of my DOM tree builder impl
  104. # [18:54] <Philip`> hsivonen: Not yet, since I was lazy for a while and didn't have anything interesting to say, and then I thought I might as well join anyway so now I'm just waiting for the application to get handled
  105. # [18:54] * zcorpan_ wonders what doctype dom apis are good for
  106. # [18:54] * Philip` should probably work out how to cache the front page of his results page
  107. # [18:55] <hsivonen> zcorpan_: nothing that isn't harmful, as far as I can tell
  108. # [18:56] <hsivonen> the main reason for supporting doctypes in the native tree API of my parser (I call it SAX Tree) is running html5lib test cases
  109. # [18:56] <hsivonen> I indend to turn doctype nodes off by default
  110. # [18:56] <hsivonen> so that hopefully fewer people shoot themselves in the foot with them
  111. # [18:56] <hsivonen> intend even
  112. # [18:56] <Philip`> <td headers> is on 4 pages, <td scope> on 14, <th scope> on 45
  113. # [18:57] <hsivonen> Philip`: any signs of an authoring tool besides a text editor being used for those pages?
  114. # [18:57] <Philip`> Three of those four with <td headers> are census.gov
  115. # [18:59] <Philip`> http://www.tppinternet.com/ puts scope="row" all over its layout tables
  116. # [19:00] <Philip`> (http://canvex.lazyilluminati.com/misc/stats/2/analyse.cgi/attr/scope has a list of relevant sites)
  117. # [19:00] <Philip`> (It only shows the top 20 - would it be worth expanding that list?)
  118. # [19:03] <Philip`> hsivonen: http://www.calicorestaurant.com/ and http://www.innodev.fi/ have some "<!-- InstanceBegin template ..." stuff that looks like a tool was involved (putting scope onto what looks like just layout tables)
  119. # [19:04] * Joins: Lionheart (robin@198.86.248.1)
  120. # [19:04] <Philip`> http://www.harneydh.com/ has some <!--DWLayoutTable--> - Dreamweaver?
  121. # [19:05] <Philip`> Those seem to be examples of accidental scope usage
  122. # [19:07] <hsivonen> kind of sad if tools put scope on layout tables
  123. # [19:08] <hsivonen> BTW, tree-buffered SAX without XML 1.0 compat options is now runnable and perhaps even usable in the whattf svn
  124. # [19:11] <Philip`> Most of the legitimate @scope I can see is on calendars
  125. # [19:23] <Philip`> http://members.aol.com/westshoretheatre/ - <!doctype html public "-//"AOL Hometown//html 3.0 transitional//en">, a few pages down after several tables and scripts - I don't think they've quite got the hang of this
  126. # [19:25] <Philip`> (That would put IE in standards mode (if it was actually at the top of the document), but HTML5/etc goes into quirks mode)
  127. # [19:29] <Philip`> http://www.magneticsforyou.com/ - that site doesn't work at all well in Opera :-(
  128. # [19:33] * Quits: schepers (schepers@128.30.52.30) (Client exited)
  129. # [19:34] * Joins: schepers (schepers@128.30.52.30)
  130. # [19:54] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  131. # [19:59] * Joins: gavin (gavin@74.103.208.221)
  132. # [20:04] * Quits: tH (Rob@87.102.76.26) (Ping timeout)
  133. # [20:12] * Joins: hasather (hasather@81.235.209.174)
  134. # [20:53] * Joins: tH (Rob@87.102.76.26)
  135. # [21:07] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
  136. # [21:11] <zcorpan_> Philip`: in your sample, 3.5% have duplicate style attributes
  137. # [21:11] <zcorpan_> that's pretty much
  138. # [21:13] <Philip`> zcorpan_: Shouldn't that be 0.35%?
  139. # [21:13] <Philip`> (27 out of 7739)
  140. # [21:15] <Philip`> (Incidentally, I need to fix my tables so they say the percentage of pages which have some feature - the current way is quite misleading...)
  141. # [21:28] <zcorpan_> Philip`: ah, yes.
  142. # [21:29] <zcorpan_> still pretty high
  143. # [21:29] <zcorpan_> and yes, percentages are more useful than numbers :)
  144. # [21:31] <zcorpan_> 0.19% with <image> tags
  145. # [21:32] <zcorpan_> "As of 2005-12, studies showed that around 0.2% of pages used the <image> element."
  146. # [21:34] <Philip`> "0.19%" is a bit optimistic in terms of the number of significant figures, given the sample size :-)
  147. # [21:35] <Philip`> http://www.imdb.com/ - point people there if you want to show them why web browsers have to support <image>
  148. # [21:35] <zcorpan_> ~ 0.2%
  149. # [21:35] <zcorpan_> which is the same as what Hixie got
  150. # [21:36] <Philip`> I should probably try to find the margin of error on these numbers, but that sounds too much like hard work
  151. # [21:59] * Joins: dbaron (dbaron@63.245.220.241)
  152. # [22:01] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  153. # [22:06] * Joins: gavin (gavin@74.103.208.221)
  154. # [22:27] * Joins: hyatt (hyatt@17.203.15.144)
  155. # [22:39] <hsivonen> I wonder why some people on the list are so keen on forcing their source aesthetics on other people
  156. # [22:42] * zcorpan_ too
  157. # [22:46] <Philip`> http://www.city-data.com/city/Hardy-Iowa.html - ooh, a <canvas>
  158. # [22:46] <Philip`> via PlotKit
  159. # [22:51] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
  160. # [22:55] <jgraham> html5lib now passes all of it's own testcases again!
  161. # [22:55] <jgraham> (this hasn't been true for some days)
  162. # [22:55] <jgraham> TODO:
  163. # [22:55] <jgraham> New character encoding detection stuff
  164. # [22:56] <jgraham> Make performance suck less (I suspect without testing that we regressed by a factor of ~2 when the input stream got rewritten)
  165. # [22:56] <jgraham> Make a release
  166. # [22:56] <jgraham> Not necessarily in that order
  167. # [22:57] * jgraham also doesn't see the value in long discussions about source formatting on the list
  168. # [22:57] <Philip`> To fix performance, you should do a cHTMLTokenizer and improve by ~2 orders of magnitude ;-)
  169. # [22:57] * Quits: dbaron (dbaron@63.245.220.241) (Quit: 8403864 bytes have been tenured, next gc will be global.)
  170. # [22:58] <jgraham> Philip`: Then we'd just move the bottleneck somewhere else
  171. # [22:59] <jgraham> I think with careful profiling we could maybe improve by a factor 5 over the current perf but I'm not sure we can do much better without a full rewrite
  172. # [23:00] <Philip`> Incidentally, I saw comments in the html5lib code about finding the frequency of each case so they can be ordered better - have you seen http://canvex.lazyilluminati.com/misc/stats/tokeniser.html ?
  173. # [23:00] <jgraham> (Maybe even a factor 5 is wildly optimistic)
  174. # [23:00] <jgraham> (and I think it would require more changes than I think are good)
  175. # [23:01] <jgraham> Philip`: Yeah. Maybe Anne will want to work on that
  176. # [23:03] * Parts: hasather (hasather@81.235.209.174)
  177. # [23:04] * Joins: hasather (hasather@81.235.209.174)
  178. # [23:05] <jgraham> (Oh and the stats are cool. Are you planning to implement the treebuilder?)
  179. # [23:09] <Philip`> (I am planning that, though by 'planning' I just mean I think it'd probably be a good thing to do, and not that I've done any actual planning or have any idea of what's involved or when I'll find time to do it)
  180. # [23:11] <Philip`> (But I do like the transform-OCaml-into-C++-(or-JS-or-etc) approach, so I'd do the tree builder like that too)
  181. # [23:40] * Joins: dbaron (dbaron@63.245.220.241)
  182. # [23:41] <zcorpan_> Philip`: from your stats: quirks: 83%, limited quirks: 19%, no quirks: 3%
  183. # [23:44] <zcorpan_> Philip`: which is 105% in total, so some pages must have more than 1 doctype
  184. # [23:45] <Philip`> zcorpan_: Oops, looks like "None" includes the pages that weren't successfully downloaded
  185. # [23:45] <zcorpan_> ah
  186. # [23:45] <Philip`> Multiply everything by 7739/8192
  187. # [23:46] <Philip`> and then ignore ~1% error since I was only listing the top 100 doctypes, and there were 162 unique ones in total
  188. # [23:46] * Quits: xover (xover@193.157.66.5) (Ping timeout)
  189. # [23:47] <Philip`> Oh, and it seems 14 pages did have multiple doctypes
  190. # [23:47] <zcorpan_> not just substract 453 from None?
  191. # [23:47] <Philip`> But then these numbers are already a bit inaccurate since they don't care whether the doctype was the first token
  192. # [23:47] <zcorpan_> indeed
  193. # [23:48] * Joins: xover (xover@193.157.66.5)
  194. # [23:49] <zcorpan_> quirks: 77%, limited quirks: 19%, no quirks: 3%
  195. # [23:49] <zcorpan_> a bit different from what i expected
  196. # [23:50] <zcorpan_> (which was 90%, 9%, 1%)
  197. # [23:50] <Philip`> Oops, yes, subtract from None - that was calculated as 8192 - (number of pages with >= 1 doctype)
  198. # [23:50] <zcorpan_> ok
  199. # [23:51] <zcorpan_> movie tiem now
  200. # [23:53] <Philip`> Fixed the script so it calculates 'none' more correctly now
  201. # [23:54] * Quits: zcorpan_ (zcorpan@90.229.146.10) (Ping timeout)
  202. # [23:57] * Quits: dbaron (dbaron@63.245.220.241) (Quit: 8403864 bytes have been tenured, next gc will be global.)
  203. # Session Close: Tue Jul 17 00:00:00 2007

The end :)