/irc-logs / freenode / #whatwg / 2007-07-04 / end

Options:

  1. # Session Start: Wed Jul 04 00:00:00 2007
  2. # Session Ident: #whatwg
  3. # [00:00] * Quits: KevinMarks (i=KevinMar@nat/google/x-2987d34f5000d2a1) ("The computer fell asleep")
  4. # [00:10] * Joins: KevinMarks (i=KevinMar@nat/google/x-ea66512c0f090208)
  5. # [00:12] <zcorpan_> annevk: are there tests on things like </p>, <html></p>, <head></p>, etc, in the html5lib tests?
  6. # [00:13] <zcorpan_> public-html starts to get pretty high traffic again
  7. # [00:24] <Hixie> typical longdesc: http://130.83.47.128/masterfiles/descriptions/logo.txt
  8. # [00:24] <webben> typical of what?
  9. # [00:25] <Hixie> typical of the longdescs that are actually not completely bogus
  10. # [00:25] <Hixie> (that's from http://130.83.47.128/vv/ss/comments/13.205.en.tud)
  11. # [00:25] <Hixie> (the first one on my list of "interesting" uses)
  12. # [00:26] <webben> not a terrible longdesc I suppose
  13. # [00:26] <webben> distinguishing between alternate text and explaining what the image is
  14. # [00:26] <Hixie> <a href="http://www.google.co.jp/">
  15. # [00:26] <Hixie> <img src="http://blog2.fc2.com/2/20century/file/Logo_20s.gif" alt="Google" height="75" width="143" longdesc="http://www.google.co.jp/logos.html" /></a>
  16. # [00:26] <webben> shame they didn't explain what the logo actually depicts
  17. # [00:27] * Hixie bangs head against table
  18. # [00:27] <jgraham> zcorpan_: I can't see any tests for those cases (htough I thought anne had checked some in...). If you want to add some I can add you to the html5lib members list
  19. # [00:28] <webben> Hixie: maybe the text is helpful for that one
  20. # [00:28] * webben can't read Japanese
  21. # [00:28] <webben> oh wait, Google can read Japanese
  22. # [00:28] <Philip`> But that logo.txt longdesc is in the wrong language for that page (which I guess could be because the site's developers had no way to actually test longdesc so it fell out of sync with the page contents)...
  23. # [00:28] <Hixie> from that en.tud page, lower down:
  24. # [00:28] <Hixie> <img src="/masterfiles/images/blue10x1.gif" alt="[Abstandhalter]" title="[Abstandhalter]" longdesc="/masterfiles/descriptions/abstandhalter.txt">
  25. # [00:28] <Hixie> guess what the "/masterfiles/descriptions/abstandhalter.txt" file contains
  26. # [00:28] <webben> Philip`: good point
  27. # [00:31] <Hixie> i think i've yet to see an actual useful, value use of longdesc="" in this study
  28. # [00:32] <Hixie> bbl
  29. # [00:32] <webben> Hixie: you should include uses of D-links
  30. # [00:32] <webben> since for a long time D-link was used as a longdesc alternative based on poor support for longdesc
  31. # [00:33] * Joins: weinig (i=weinig@nat/apple/x-a260b6922c3b12a6)
  32. # [00:34] * Quits: weinig_ (i=weinig@nat/apple/x-a4970a9ef18c9aca) (Read error: 104 (Connection reset by peer))
  33. # [00:34] <webben> see also: http://www.w3.org/TR/WCAG10-HTML-TECHS/#long-descriptions
  34. # [00:34] <webben> it would be interesting to know how many links in the wild have a value of D or [D] or similar
  35. # [00:34] <webben> s/value/text content/
  36. # [00:36] * Philip` wants to rewrite his own rubbish survey tool to be slightly less rubbish, so he can get vaguely interesting numbers about common features
  37. # [00:37] <webben> how many links ... and what they point to, of course
  38. # [00:37] * jgraham wants a google-scale cluster to run a survey on
  39. # [00:38] <jgraham> and a pony, of course
  40. # [00:39] <jgraham> But seriously, Philip`, it would be nice if your survey tool was more widely available. It would be even better if the parser was fast. I wonder if any of the HTML5-parser-in-C projects are going to produce something soon?
  41. # [00:40] <Philip`> At least my initial version taught me that SQLite is completely rubbish when you have concurrency - it kept throwing exceptions because the whole database was locked
  42. # [00:40] <Philip`> so I need to rewrite it with MySQL or something
  43. # [00:42] * Quits: the_mart (n=Martin@host86-135-9-158.range86-135.btcentralplus.com) ("Leaving")
  44. # [00:42] <Philip`> and I think it should do some simple crawling, rather than only looking at a fixed list of URLs, so it can find more stuff to look at
  45. # [00:43] <Philip`> (and a faster parser would definitely be useful :-) )
  46. # [00:44] * Joins: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca)
  47. # [00:45] <Philip`> (A Java one would probably be as good as a C one)
  48. # [00:47] <bewest> sounds like a bunch of people are interested in some kind of survey tool available to the community
  49. # [00:48] <webben> Here's a good example of longdesc-as-long-alternative: http://www.fhwa.dot.gov/hfl/framework/04.cfm referring to http://www.fhwa.dot.gov/hfl/framework/longdesc.cfm#fig1
  50. # [00:48] <bewest> purpose would be 2-fold, correct? 1.) survey useage of authoring techniques on the web. 2.) test parsers?
  51. # [00:49] <Philip`> 3.) Confirm whether Hixie's stats are reasonable, or if he's just making up all the numbers :-)
  52. # [00:50] <bewest> I've thought about doing this with ec2 and Alexa's web services
  53. # [00:50] <bewest> eg greptheweb, and MSR
  54. # [00:50] <bewest> alexa has crawled documents in s3
  55. # [00:51] <bewest> but that costs money
  56. # [00:52] <zcorpan_> jgraham: sure. i might check in this browser port too
  57. # [00:53] <zcorpan_> othermaciej: rewrote the function to not be recursive but still get the same error in safari
  58. # [00:53] <bewest> Philip`: so you already have some kind of survey tool? how does it work?
  59. # [00:54] <Philip`> bewest: Ah, I wasn't aware of those things, though I tend to never consider anything that requires money :-)
  60. # [00:55] <bewest> yeah...
  61. # [00:55] <bewest> usually I don't either
  62. # [00:55] <bewest> except that I work at the company that makes those services
  63. # [00:55] <Philip`> It was just something simple for things like http://canvex.lazyilluminati.com/misc/copyright.html and http://canvex.lazyilluminati.com/misc/summary.html
  64. # [00:56] <Philip`> (and a few other things which I can't remember where I put)
  65. # [00:56] <Philip`> where I give it a list of a few thousand URLs (from Yahoo search results for arbitrary terms), and it just downloads them then parses them (with html5lib) and looks for certain stuff
  66. # [00:57] <Philip`> (and sort of does those things in parallel, if you run lots of copies of the program, except most of the processes keep dying because SQLite gets unhappy)
  67. # [00:58] <Philip`> (and then some pages cause quadratic behaviour in html5lib and you have to manually delete them from the database)
  68. # [00:58] <Philip`> (so it's all just horribly hacked together :-p )
  69. # [00:59] <bewest> heh
  70. # [01:00] <othermaciej> zcorpan_: that's odd
  71. # [01:00] <othermaciej> zcorpan_: pointer?
  72. # [01:01] <zcorpan_> othermaciej: http://simon.html5.org/temp/html5lib-tests/wrapper.html
  73. # [01:01] <Hixie> webben: studying text contents is much harder for various reasons
  74. # [01:02] <webben> of course it's harder
  75. # [01:02] <webben> but given we're talking about what's basically a language for marking up text, such study is pretty critical
  76. # [01:03] <Hixie> be my guest :-)
  77. # [01:05] <othermaciej> zcorpan_: very confusing
  78. # [01:05] <othermaciej> zcorpan_: I'll try debugging it in a while - need to get coffee first
  79. # [01:05] <zcorpan_> othermaciej: ok
  80. # [01:06] <zcorpan_> man, i've really spent all day on this thing
  81. # [01:07] <Hixie> how does it feel to be paid to do this nonsense? :-)
  82. # [01:07] <jgraham> zcorpan_: You should now be able to commit to html5lib svn If you're committing tests that html5lib doesn't pass, it's really good to email html5lib-discuss@googlegroups.com so people know there hasn't been a regression
  83. # [01:08] <zcorpan_> Hixie: feels great :)
  84. # [01:08] <zcorpan_> jgraham: ok. thanks
  85. # [01:09] <Hixie> hey i guess working for opera also means you get w3c member access
  86. # [01:09] <zcorpan_> yeah
  87. # [01:09] <Hixie> now you can see the crazyness you've previously only been able to imagine
  88. # [01:10] <jgraham> zcorpan_: I think you need to join the html5lib-discuss group to post to it btw.
  89. # [01:10] <Philip`> Are you being paid to work on this at 1am? :-)
  90. # [01:10] <zcorpan_> Philip`: yep :)
  91. # [01:10] <zcorpan_> Philip`: plus, i work from home
  92. # [01:10] <zcorpan_> my work day starts when i want and ends when i want
  93. # [01:11] <Dashiva> h4x
  94. # [01:11] <zcorpan_> which is usually when i wake up and when i go to bed, respectively
  95. # [01:11] * othermaciej is now known as om_coffee
  96. # [01:11] <Dashiva> We have core time in Oslo
  97. # [01:13] <zcorpan_> Hixie: i read the pointers in http://ln.hixie.ch/?start=1172653243&count=1 but i haven't looked at other crazyness
  98. # [01:13] <Hixie> btw i'm going to be in oslo (though extremely tired) late next monday and early next tuesday
  99. # [01:13] <Hixie> i'll probably pop by the opera offices
  100. # [01:14] * zcorpan_ wonders if anyone will pop by the eskilstuna office
  101. # [01:15] <Dashiva> Just as I take two days off. I'm going to miss the munchkin playing, no doubt.
  102. # [01:19] <zcorpan_> anything interesting on public-html the past 24h?
  103. # [01:20] * Quits: billmason (n=billmaso@ip156.unival.com) (Read error: 104 (Connection reset by peer))
  104. # [01:20] * Quits: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM) ("ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508]")
  105. # [01:22] <Hixie> i just found this interesting tidbit:
  106. # [01:22] <Hixie> Tantek Çelik (Microsoft): We are in the XHTML WG. I am the representative; recently it has become clear that the priorities of the XHTML WG are different from our priorities. We would like to see the HTML 4 and XHTML 1.x versions resolved. Most of the folks in the WG are XHTML 2 and that is not a priority for us.
  107. # [01:22] <Hixie> from http://www.w3.org/2004/04/webapps-cdf-ws/minutes-20040601.html
  108. # [01:22] <Hixie> Steven Pemberton (W3C/CWI): If you want that done, you have to do it.
  109. # [01:23] * Quits: kingryan (n=kingryan@corp.technorati.com) (Remote closed the connection)
  110. # [01:23] * Joins: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
  111. # [01:25] <tantek> Thanks for the memory Hixie :)
  112. # [01:25] <tantek> yes, that workshop is where everything "blew up" as the kids say
  113. # [01:25] <Hixie> indeed
  114. # [01:26] <Hixie> but i didn't realise that steven actually told us to go do html5
  115. # [01:26] <tantek> he didn't
  116. # [01:26] <tantek> he told you to go do html5, and me to go do microformats
  117. # [01:26] <tantek> he just didn't realize he did ;)
  118. # [01:26] <tantek> and yes, you're welcome for the setup :)
  119. # [01:27] <Hixie> :-)
  120. # [01:28] <tantek> out of that workshop i was more convinced than ever that I had to leave microsoft and pursue microformats wherever there was support for them, knowing that you would have a pretty good handle on the HTML 4.x XHTML 1.x updates.
  121. # [01:32] <tantek> Hixie, it wouldn't be inaccurate for you to even state that Microsoft's representative to that workshop called for work on HTML4 and XHTML1 along a set of requirements remarkably similar to those adopted by WHATWG.
  122. # [01:32] <Hixie> indeed
  123. # [01:32] <tantek> thereby confirming all the conspiracy theorists suspicions that WHATWG is merely doing Microsoft's bidding. ;)
  124. # [01:33] * Quits: weinig (i=weinig@nat/apple/x-a260b6922c3b12a6) (Read error: 104 (Connection reset by peer))
  125. # [01:33] <Hixie> oh the modern conspiracy theory is that it's google's attempt at getting around the problem that converting adsense to xhtml2 would be too hard
  126. # [01:33] <zcorpan_> LOL
  127. # [01:36] * Joins: weinig (i=weinig@nat/apple/x-3021b5e01346d7af)
  128. # [01:41] * Quits: hendry (n=hendry@91.84.62.62) ("nn")
  129. # [01:50] * om_coffee is now known as othermaciej
  130. # [01:52] * Quits: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
  131. # [02:05] * Joins: epeus (i=KevinMar@conference/plone/docsprint/x-ea4c9cc997546964)
  132. # [02:08] * Joins: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
  133. # [02:08] * Quits: KevinMarks (i=KevinMar@nat/google/x-ea66512c0f090208) (Nick collision from services.)
  134. # [02:08] * epeus is now known as KevinMarks
  135. # [02:10] * Joins: kingryan (n=kingryan@dsl081-240-149.sfo1.dsl.speakeasy.net)
  136. # [02:24] * Joins: weinig_ (i=weinig@nat/apple/x-1d2c33c52f79e762)
  137. # [02:24] * Joins: epeus (i=KevinMar@nat/google/x-55d456545ad17e99)
  138. # [02:25] * Quits: syp| (n=syp@lasigpc9.epfl.ch) (kubrick.freenode.net irc.freenode.net)
  139. # [02:25] * Quits: fuzzy76 (i=fuzzy76@matilda.td.org.uit.no) (kubrick.freenode.net irc.freenode.net)
  140. # [02:25] * Joins: syp| (n=syp@lasigpc9.epfl.ch)
  141. # [02:25] * Joins: fuzzy76 (i=fuzzy76@matilda.td.org.uit.no)
  142. # [02:25] * Quits: weinig (i=weinig@nat/apple/x-3021b5e01346d7af) (Read error: 104 (Connection reset by peer))
  143. # [02:27] * Quits: KevinMarks (i=KevinMar@conference/plone/docsprint/x-ea4c9cc997546964) (Nick collision from services.)
  144. # [02:27] * epeus is now known as KevinMarks
  145. # [02:30] * Quits: KevinMarks (i=KevinMar@nat/google/x-55d456545ad17e99) ("The computer fell asleep")
  146. # [02:31] <webben> Hixie: more vaguely sane long descriptions: http://www.tsu.ox.ac.uk/info/report.php
  147. # [02:32] <webben> (although I think they could have madeuse of data tables)
  148. # [02:33] <webben> another example: http://docs.sun.com/source/817-5763/
  149. # [02:34] <webben> in general, look through this search: http://www.google.co.uk/search?hl=en&q=%22long+description+for%22 for lots of longdesc examples
  150. # [02:36] <Hixie> my script uses the same source data as that search, basically
  151. # [02:39] * Quits: zcorpan_ (n=zcorpan@84-216-41-27.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
  152. # [02:46] * Philip` never knew that IE supports <comment>...</comment>
  153. # [02:47] <Philip`> (Interestingly the text appears to be not in the DOM, but is in the innerHTML view)
  154. # [02:55] * Quits: webben (n=benh@91.84.193.157)
  155. # [02:56] * Quits: jgraham (n=jgraham@81-86-214-45.dsl.pipex.com) (Read error: 110 (Connection timed out))
  156. # [03:02] * Joins: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp)
  157. # [03:13] * Quits: aroben (n=adamrobe@17.203.15.248)
  158. # [03:16] * Quits: weinig_ (i=weinig@nat/apple/x-1d2c33c52f79e762)
  159. # [03:28] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
  160. # [03:39] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
  161. # [03:45] * Joins: yod (n=ot@dhcp-247-181.mag.keio.ac.jp)
  162. # [03:52] * Joins: KevinMarks (n=KevinMar@c-76-102-254-252.hsd1.ca.comcast.net)
  163. # [03:58] * Joins: weinig (i=weinig@nat/apple/x-4db5afe5bef23360)
  164. # [04:07] * Joins: kfish (n=conrad@61.194.21.25)
  165. # [04:11] * Quits: tantek (n=tantek@corp.technorati.com)
  166. # [04:14] * Quits: h3h (n=w3rd@66-162-32-234.static.twtelecom.net) ("|")
  167. # [04:17] <Hixie> heh, i just noticed something about the press release the w3c put out when the charters were announced
  168. # [04:18] <othermaciej> yeah?
  169. # [04:18] <Hixie> it says:
  170. # [04:18] <Hixie> "With the chartering of the XHTML 2 Working Group, W3C will continue its technical work on the language at the same time it considers rebranding the technology to clarify its independence and value in the marketplace."
  171. # [04:19] <othermaciej> hah!
  172. # [04:20] * Quits: bzed (n=bzed@dslb-084-059-121-172.pools.arcor-ip.net) ("Leaving")
  173. # [04:20] <othermaciej> "dear xhtml2 wg, how is that rebranding coming along? love, the html wg"
  174. # [04:22] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  175. # [04:28] * Quits: MikeSmith (n=MikeSmit@eM60-254-215-75.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
  176. # [04:29] * Joins: MikeSmith (n=MikeSmit@eM60-254-213-154.pool.emobile.ad.jp)
  177. # [04:32] * Joins: Philip`_ (n=philip@zaynar.demon.co.uk)
  178. # [04:49] * Quits: Philip` (n=philip@zaynar.demon.co.uk) (Read error: 110 (Connection timed out))
  179. # [05:07] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  180. # [05:39] * Quits: Yudai (n=Yudai@p931010.tokyte00.ap.so-net.ne.jp) (Read error: 110 (Connection timed out))
  181. # [05:39] * Joins: Yudai (n=Yudai@pae3703.tokyte00.ap.so-net.ne.jp)
  182. # [05:44] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) ("ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502]")
  183. # [05:53] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
  184. # [06:07] * Quits: kingryan (n=kingryan@dsl081-240-149.sfo1.dsl.speakeasy.net)
  185. # [06:17] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
  186. # [06:24] * Quits: weinig (i=weinig@nat/apple/x-4db5afe5bef23360)
  187. # [06:35] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
  188. # [06:45] <hsivonen> annevk: I meant that when you've got a form control whose form pointer does not point to an ancestor and that doesn't have a form='' attribute pointing to the same node as the form pointer, generate an id attribute on the node pointed by the form pointer if there isn't an id already and generate a corresponding form='' attribute on the form control
  189. # [06:45] <hsivonen> annevk: this fails if the <form> element already has an id='' attribute and the value of that attribute is a duplicate
  190. # [06:51] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  191. # [06:51] * Quits: jcgregorio (n=chatzill@adsl-072-148-043-048.sip.rmo.bellsouth.net) ("ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007060115]")
  192. # [06:57] <hsivonen> othermaciej: Also I suggested the iterative DOM traversal algorithm to zcorpan, but does IE guarantee that the algorithm terminates? I think it doesn't.
  193. # [06:58] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  194. # [06:59] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
  195. # [06:59] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  196. # [07:01] <othermaciej> hsivonen: oh - good point, I'm not sure how it works in the face of a non-tree
  197. # [07:01] <othermaciej> hsivonen: I'm not sure what exactly IE's non-tree DOMs look like
  198. # [07:03] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  199. # [07:03] <hsivonen> othermaciej: this is one significant reason why a non-tree DOM sucks
  200. # [07:04] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  201. # [07:06] <othermaciej> hsivonen: I have seen a look of shocked realization on the faces of JS library authors when they heard that IE can do that
  202. # [07:07] <othermaciej> "that explains those weird infinite loop bugs!"
  203. # [07:07] <othermaciej> do you actually know what it does though?
  204. # [07:07] <othermaciej> is it just the parent pointer that can be wrong? you could work around that with a stack
  205. # [07:10] <Hixie> see my blog
  206. # [07:10] <Hixie> entries starting with "Tag Soup" iirc
  207. # [07:10] <Hixie> bbl
  208. # [07:11] * Quits: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca) ("http:/www.csarven.ca")
  209. # [07:14] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  210. # [07:16] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  211. # [07:17] <hsivonen> othermaciej: not sure. The edges between EM and ADDRESS in the Mac IE 5 DOM with Hixie's case look like the ingredients of an infinite loop: http://hsivonen.iki.fi/soup-dom/ (I can't test IE6 here.)
  212. # [07:22] <othermaciej> good lord, that's insane
  213. # [07:22] * othermaciej blames tantek
  214. # [07:23] <othermaciej> child pointer indicates presence in the childNodes array?
  215. # [07:24] <hsivonen> Philip`_: If you'd like to run surveys with something that runs as native instructions at run time, I suggest figuring out which Java spider framework can easily take a plugged HTML5 parser
  216. # [07:25] <othermaciej> hsivonen: it looks like traversal via firstChild/nextSibling/parentNode would not infinite loop on that, but it would miss some elements
  217. # [07:25] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  218. # [07:25] <othermaciej> wait, maybe it wouldn't even iss anything
  219. # [07:25] <hsivonen> Philip`_: the parser needs to get a java.io.InputStream, the value of the HTTP charset (null if absent), a SAX ErrorHandler and a SAX ContentHandler (for extracting links)
  220. # [07:25] <hsivonen> othermaciej: child is firstchild
  221. # [07:26] <hsivonen> othermaciej: IIRC
  222. # [07:26] <othermaciej> it can't be only firstChild, since you can't have multiple firstChilds
  223. # [07:26] <hsivonen> othermaciej: oh. right. can't rememeber anymore what I did
  224. # [07:28] <othermaciej> some nodes would be visited more than once I guess, w/ tree-based traversal
  225. # [07:29] <othermaciej> we have some ex-MacIE folks on our team, I could ask them what they were thinking :-)
  226. # [07:29] <hsivonen> Philip`_: the Internet Archive spider looks promising, but they seem to rely on the JVM running on Linux with a particular thread impl
  227. # [07:30] <hsivonen> Philip`_: btw, I wouldn't run a Java spider that used java.net.URLConnection without socket timeouts
  228. # [07:30] <hsivonen> I have more confidence in Commons HTTP Client
  229. # [07:31] <hsivonen> I haven't checked which HTTP client the Internet Archive spider uses
  230. # [08:10] <Hixie> hm, xmlns="...xhtml" usage has gone up to 20% according to the survey i just did (of several billion html docs)
  231. # [08:11] <Hixie> from about 15% about a year ago
  232. # [08:15] <Hixie> and 41% have no DOCTYPE, down from about 50% at the same time iirc
  233. # [08:16] <Hixie> 19% have the XHTML1 DOCTYPE, 11% have a 4.01 Transitional DOCTYPE with no URI
  234. # [08:17] <Hixie> 6% are 4.01 Transitional with URI
  235. # [08:19] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  236. # [08:24] * Quits: kfish (n=conrad@61.194.21.25) (Remote closed the connection)
  237. # [08:25] * Joins: kfish (n=conrad@61.194.21.25)
  238. # [08:36] <Hixie> and the 0.014% of XHTML usage has gone up to 0.062%
  239. # [08:37] <hsivonen> Hixie: real XHTML? as in a/x+x
  240. # [08:38] <hsivonen> Amazon EC2 was mentioned earlier. any actual experience with using it?
  241. # [08:47] * othermaciej is surprised to hear there's that many sites that give the finger to IE; or is that conditionally served?
  242. # [08:50] <Hixie> hsivonen: yeah
  243. # [08:50] <Hixie> othermaciej: might be conditional, dunno
  244. # [08:51] <hsivonen> Hixie: does Google unify multiple representations of a page if it finds foo with Content-Location, foo.html and foo.xhtml?
  245. # [08:54] <Hixie> duplicate elimination happens before my script gets hold of the data, yes, but i don't know exactly what gets counted as a dupe
  246. # [08:55] * Joins: peepo (n=Jay@86.157.113.34)
  247. # [08:56] <hsivonen> hmm. looks like Google has changed its behavior again and now http://hsivonen.iki.fi/thesis/html5-conformance-checker over .html or .xhtml. IIRC, it returned http://hsivonen.iki.fi/thesis/html5-conformance-checker.xhtml a couple of weeks ago
  248. # [08:58] <hsivonen> s/now/now prefers/
  249. # [08:59] <Hixie> it probably treats them separately and picks one based on which has the most "relevance"
  250. # [09:05] * Joins: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za)
  251. # [09:10] * Joins: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM)
  252. # [09:32] * Joins: BenWard (i=BenWard@nat/yahoo/x-36d10ff5536839e6)
  253. # [09:32] * Quits: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp) ("Where dwelt Ymir, or wherein did he find sustenance?")
  254. # [09:32] * Quits: yod (n=ot@dhcp-247-181.mag.keio.ac.jp) ("This computer has gone to sleep")
  255. # [09:59] * Joins: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
  256. # [09:59] * Joins: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com)
  257. # [10:15] * Joins: the_mart (n=Martin@host86-135-9-158.range86-135.btcentralplus.com)
  258. # [10:17] * Quits: peepo (n=Jay@86.157.113.34) ("later")
  259. # [10:21] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  260. # [10:24] <hsivonen> http://www.w3.org/mid/886507.69879.qm@web50802.mail.re2.yahoo.com
  261. # [10:26] * Joins: hendry (n=hendry@91.84.62.62)
  262. # [10:27] <annevk> http://lists.w3.org/Archives/Public/www-validator/2007Jul/0011.html
  263. # [10:27] <zcorpan_> oh of course. writing your own dtd makes you validate.
  264. # [10:28] <annevk> it's true
  265. # [10:28] <annevk> it's just not very smart
  266. # [10:28] * Quits: kfish (n=conrad@61.194.21.25) ("RW")
  267. # [10:28] * Joins: billyjack (n=MikeSmit@eM60-254-242-228.pool.emobile.ad.jp)
  268. # [10:29] <zcorpan_> might be if you really use validation as qa check, and you don't want to flag files that have 1 error you already know about and have to have around
  269. # [10:30] * Quits: MikeSmith (n=MikeSmit@eM60-254-213-154.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
  270. # [10:31] * Joins: webben (i=benh@nat/yahoo/x-c93aa498557bcb6c)
  271. # [10:42] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  272. # [10:45] * Joins: ROBOd (n=robod@86.34.246.154)
  273. # [10:45] * Quits: webben (i=benh@nat/yahoo/x-c93aa498557bcb6c)
  274. # [10:51] * Joins: webben (i=benh@nat/yahoo/x-7630519bda45a319)
  275. # [11:04] <Lachy> Hixie, yt?
  276. # [11:07] <annevk> zcorpan_, http://simon.html5.org/temp/html5lib-tests/dom2string.js doesn't seem to handle attributes
  277. # [11:08] <zcorpan_> annevk: oops
  278. # [11:09] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Remote closed the connection)
  279. # [11:10] * Joins: annevk (n=annevk@pat-tdc.opera.com)
  280. # [11:13] <zcorpan_> annevk: fixed
  281. # [11:18] <Hixie> Lachy: yo
  282. # [11:19] <Lachy> Hey Hixie, Marcos and I are working on the XBL Primer, and we're trying to come up with a concise description of what a template is. Any suggestions?
  283. # [11:20] <Hixie> it's some markup that will be used to render the bound element, i guess
  284. # [11:20] <Lachy> so far we have "A template is used to control the presentation of a document", but we want to say something about how it reorders content in the DOM, without altering it, using shadow trees, but without using technical terms
  285. # [11:20] <annevk> interesting, Opera returns uppercase attribute names
  286. # [11:21] <zcorpan_> annevk: yeah.
  287. # [11:21] <Hixie> Lachy: good luck
  288. # [11:21] <Lachy> thanks
  289. # [11:21] <Hixie> Lachy: my best attempt is what's in the spec
  290. # [11:21] <Hixie> Lachy: in the note in the definition of <template>
  291. # [11:22] <annevk> "A template defines the building blocks for the subtree of the bounding element."
  292. # [11:22] <Lachy> yeah, that's the problem :-)
  293. # [11:23] <Lachy> hmm. we could try and work something like that into it.
  294. # [11:24] <annevk> just say something and then illustrate it with some "easy" to grasp examples
  295. # [11:24] <Lachy> yeah, that's the idea
  296. # [11:27] <zcorpan_> hm. opera can have cdata nodes in the dom. how should i output those?
  297. # [11:27] <zcorpan_> "<![CDATA[ " + current.nodeValue + " ]]>" ?
  298. # [11:29] <annevk> yeah
  299. # [11:32] <zcorpan_> done
  300. # [11:38] <Hixie> i'm instrumenting my html parser to report how many times it clones nodes in the AAA and inline-reconstruction algorithms
  301. # [11:38] <Hixie> anything else i can instrument while i'm at it?
  302. # [11:39] <Hixie> hsivonen? annevk? jgraham?
  303. # [11:40] <annevk> we have some XXX comments about tokenization...
  304. # [11:41] <annevk> specifically which cases in states are the most frequent
  305. # [11:41] <annevk> so you can optimize those cases in some way...
  306. # [11:42] <annevk> other interesting things might be <form> nodes <form> where nodes does not include </form> and then do some browser testing on those more complicated examples from real world pages
  307. # [11:44] <Hixie> eh?
  308. # [11:45] <Hixie> i could emit for each tokeniser state the most common tokens seen, i guess
  309. # [11:46] <Hixie> it would make the parser way slower, but it could work
  310. # [11:46] <annevk> it's probably not very important
  311. # [11:46] * Joins: maikmerten (n=maikmert@T6eaf.t.pppool.de)
  312. # [11:46] <annevk> tree mutation and node duplication are more interesting
  313. # [11:47] <annevk> would be fun to count how often you encounter <canvas> nowadays :)
  314. # [11:49] <Hixie> i've looked at elements in a separate study
  315. # [11:50] <Hixie> canvas didn't appear in the top 200
  316. # [11:51] * zcorpan_ suspects that some <canvas>es are only output with script
  317. # [12:00] <annevk> k
  318. # [12:00] <zcorpan_> hmm. dom core doesn't specify an order for .attributes ... i need to sort them myself
  319. # [12:01] <annevk> I wonder if we have actually sorted them...
  320. # [12:03] <zcorpan_> opera and safari don't seem to sort them. ie seems to sort them alphabetically. firefox alphabetically reversed.
  321. # [12:03] <Hixie> ok i'm going to emit a list of total count of all the tokens
  322. # [12:04] <Hixie> for each kind of token in each insertion mode
  323. # [12:04] <Hixie> anything else?
  324. # [12:04] <Hixie> last chance before i set this off and go to bed...
  325. # [12:04] <annevk> ah, I actually meant characters I think
  326. # [12:04] <annevk> but that may be too expensive
  327. # [12:04] <Hixie> characters?
  328. # [12:04] <annevk> during tokenization
  329. # [12:04] <Hixie> how do you mean?
  330. # [12:05] <zcorpan_> see how often ">" (with quotes) appears in doctypes or bogus comments
  331. # [12:05] <annevk> so you can optimize a particular tokenization state
  332. # [12:05] <Hixie> oh i thought you wanted to optimise the tree constructor states
  333. # [12:06] <Hixie> zcorpan_: hm
  334. # [12:06] <hsivonen> Hixie: hmm. I guess there might be merit in instrumenting how often IN_BODY code runs with the actual insertion mode being one of the table modes other than caption and cell
  335. # [12:06] <Hixie> annevk: surely for the tokeniser it makes no difference since you'll just do table dispatch
  336. # [12:06] <annevk> IE has this nice <!- .... ">" more comment ... >
  337. # [12:07] <zcorpan_> Hixie: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012078.html
  338. # [12:07] <Hixie> hsivonen: you mean an average of times per page that the inbody state is invoked when the state is not inbody, incell, or incaption?
  339. # [12:07] <hsivonen> Hixie: is it even important to clone DOM nodes instead of using the attributes on the original token and creating a new DOM node using those?
  340. # [12:07] <Hixie> zcorpan_: yeah i'm just trying to work out how to do it
  341. # [12:07] <hsivonen> that is, do you really want to close concurrent attribute changes?
  342. # [12:08] <Hixie> i don't think the dom supports having attributes shared between nodes
  343. # [12:09] <hsivonen> Hixie: yes, the average times the table states actually fall though to in body
  344. # [12:09] <hsivonen> through
  345. # [12:12] <Hixie> ok, i'm logging the actual insertion mode when my inhead, inbody, and intable functions are invoked
  346. # [12:12] <hsivonen> Hixie: since that only happens in non-conforming cases and Java doesn't have goto, I let the code hit some useless branches when the fall-through happens
  347. # [12:12] <Hixie> hopefully they map exactly to the spec
  348. # [12:14] <Hixie> zcorpan_: for DOCTYPEs we don't care, right? since what the spec does matches IE anyway?
  349. # [12:14] <hsivonen> (A smart compiler could fix this, but I doubt javac or hotspot are that smart)
  350. # [12:14] <annevk> yeah, DOCTYPEs match IE
  351. # [12:14] <annevk> it's just that IE uses the same mode for bogus comments as they use for DOCTYPEs it seems
  352. # [12:15] <Hixie> i'm gonna bail on working out what characters are most common in each tokeniser mode, on the principle that there are so few states it hardly matters anyway
  353. # [12:15] <zcorpan_> Hixie: not quite. the spec doesn't handle <!doctype ">" >
  354. # [12:15] <annevk> oops
  355. # [12:15] <zcorpan_> Hixie: the spec only matches ie if the > is in an actual FPI or SPI
  356. # [12:16] <hsivonen> Hixie: oh yeah, one more thing for optimization: whether an average stack node is tested for being in a group of element names more than once
  357. # [12:17] <Hixie> well i didn't find any DOCTYPEs with > in their name part, at least not enough to appear on my radar in the scan of doctypes i did earlier this week
  358. # [12:17] <hsivonen> Hixie: that is, whether it makes sense to have a boolean on a stack node that says for example whether the node is a table context sentinel
  359. # [12:17] <zcorpan_> Hixie: ok
  360. # [12:17] <zcorpan_> Hixie: isn't that because > in the name part terminates the doctype? :)
  361. # [12:18] <hsivonen> Hixie: or whether a stack node should have a flag for phrasing OR formatting OR div OR address
  362. # [12:18] <Hixie> sorry, i meant "
  363. # [12:18] <zcorpan_> ah
  364. # [12:18] <zcorpan_> ok
  365. # [12:18] <Hixie> hsivonen: so what i did with that is that each well-known tag name has an integer associated with it (like an atom) and for each special feature that the parser cares about i used a bit
  366. # [12:19] <Hixie> i used 24 bits for these flags
  367. # [12:20] <Hixie> so for example all the <hx> elements have the number 0x400008400000
  368. # [12:20] <hsivonen> Hixie: my strategy is to intern well-known names so that testing against one name is a comparison of memory addresses but still testing if a name is in a group means as many comparisons as names names in group
  369. # [12:20] <Hixie> the leading 0x4 is "element" (as opposed to text node), the 8 is "hx node", and the 4 is "closes <p> elements"
  370. # [12:21] <Hixie> yeah so my parser never compares tag names once they're in the stack
  371. # [12:21] <Hixie> doing string compares was prohibitively expensive
  372. # [12:21] <hsivonen> interesting
  373. # [12:21] <Hixie> i just use the integer that says whether a node is a text node, comment node, doctype, etc, to say what special kind of element it is too
  374. # [12:22] <Hixie> and so everything is always exactly one & and exactly one ==
  375. # [12:22] * Joins: Ducki (n=Alex@dialin-145-254-186-173.pools.arcor-ip.net)
  376. # [12:23] <annevk> and you construct those numbers during tokenization?
  377. # [12:23] <hsivonen> I guess I'll complete the tree builder with my current approach and will leave a tokenizer-assigned bitfield as a later interface-breaking optimization
  378. # [12:24] <Hixie> annevk: whenever i create a node, i create it withe the appropriate constant
  379. # [12:24] <Hixie> the tokeniser doesn't know about these
  380. # [12:24] <Hixie> it emits tokens with tag names
  381. # [12:24] <Hixie> it's only when i create nodes that i use these
  382. # [12:24] <hsivonen> Hixie: ooh. so "closes p" is not assigned in the tokenizer after all
  383. # [12:24] <annevk> ok, so the tree construction stage does use string comparison?
  384. # [12:25] <Hixie> yeah, tokens are string-compared
  385. # [12:25] <Hixie> but i think my compiler might be atomising them
  386. # [12:25] <Hixie> so it's not such a big deal
  387. # [12:27] <hsivonen> I'm currently using the generic String.intern(), but I figured how to make a fast interning function with knowledge about the possible names (three-level switch: length, last char, second to last char)
  388. # [12:27] <hsivonen> but typing that is too much work
  389. # [12:27] <hsivonen> so I guess I'll write a small Python program that generates Java code for the interning function at some point
  390. # [12:28] <Hixie> zcorpan_: given that only IE does this, I'm going to assume it's not a big deal. I can investigate it in more detail later maybe. Don't want to hack the parser too much tonight. :-)
  391. # [12:28] <Hixie> beware that the names are unbounded
  392. # [12:28] <Hixie> <fiv> is an element name that is seen in the wild, e.g.
  393. # [12:28] <Hixie> you don't want to treat it as <div>
  394. # [12:29] <Hixie> especially in your case :-)
  395. # [12:30] <hsivonen> Hixie: of if the length is > 2, the prefix needs to be compared, too, to make sure
  396. # [12:30] <hsivonen> Hixie: still better than an intermediate copy to java.lang.String
  397. # [12:31] <hsivonen> Hixie: the idea is to weed out all but one prefix candidate
  398. # [12:31] <Hixie> ah cool
  399. # [12:33] * Joins: Ducki_ (n=Alex@dialin-145-254-189-168.pools.arcor-ip.net)
  400. # [12:36] <Hixie> right sleep time
  401. # [12:36] <Hixie> nn
  402. # [12:36] <hsivonen> nn
  403. # [12:37] * Quits: Ducki (n=Alex@dialin-145-254-186-173.pools.arcor-ip.net) (Read error: 113 (No route to host))
  404. # [12:41] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 104 (Connection reset by peer))
  405. # [12:56] * Joins: zcorpan (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
  406. # [13:03] * Quits: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
  407. # [13:20] * Quits: webben (i=benh@nat/yahoo/x-7630519bda45a319)
  408. # [13:23] * Joins: webben (i=benh@nat/yahoo/x-a060493131c95b1e)
  409. # [13:26] <zcorpan> the parser test format doesn't distinguish between an "" attrubute and a text node "=" (e.g.: <p "">"="</p>)
  410. # [13:26] <zcorpan> | <p>
  411. # [13:26] <zcorpan> | ""=""
  412. # [13:26] <zcorpan> | ""=""
  413. # [13:26] * Quits: webben (i=benh@nat/yahoo/x-a060493131c95b1e) (Client Quit)
  414. # [13:27] <annevk> that's not too relevant though
  415. # [13:27] <annevk> but an interesting edge case
  416. # [13:28] <zcorpan> perhaps " in text nodes should be escaped with \?
  417. # [13:28] <annevk> why?
  418. # [13:29] <zcorpan> so you can tell the difference between attributes and text nodes. but perhaps it doesn't matter
  419. # [13:30] <annevk> just don't mix them
  420. # [13:32] <annevk> also, if you make mistakes in your parser at that level you've got bigger issues :)
  421. # [13:33] <zcorpan> which parser?
  422. # [13:33] <annevk> HTML parser?
  423. # [13:33] <zcorpan> ah. yeah.
  424. # [13:37] * Quits: ROBOd (n=robod@86.34.246.154) ("http://www.robodesign.ro")
  425. # [13:38] * Joins: mw22 (n=chatzill@h8441169151.dsl.speedlinq.nl)
  426. # [13:41] * Parts: mw22 (n=chatzill@h8441169151.dsl.speedlinq.nl)
  427. # [13:42] <Philip`_> hsivonen: I think it might be reasonable to keep the spidering and parsing completely separate, so they could be different languages (depending on what useful tools are available for), just communicating asynchronously through some database (which is probably necessary anyway to support parallelism)
  428. # [13:44] * Joins: ROBOd (n=robod@86.34.246.154)
  429. # [13:55] <hsivonen> Philip`_: I've never done wide-scale spidering. however, I would think that sticking stuff in a database in between would slow things significantly compared to the parser reading from the real socked when the spidering happens (possible with e.g. Commons HttpClient)
  430. # [13:57] <hsivonen> to me, it seems that the obvious way to implement this is to have a number of worker threads that run both the parser and the HTTP client and request URLs and report results to a centralized thread-safe coordination object
  431. # [13:57] <hsivonen> s/socked/socket/
  432. # [13:59] <hsivonen> as for tools in different languages, if you can't make everything run on a JVM, communicating through a local socket is more efficient that having an persistence layer in between
  433. # [13:59] <hsivonen> I am assuming here that we don't want to keep copies of the spidered bytes
  434. # [14:00] <Philip`_> It would be useful to allow the thing to run on multiple computers to spread the load out, and then it would need some network communication for coordination instead of just threads
  435. # [14:01] <hsivonen> Philip`_: it might be worth investigating if instead of running a spider we should run on EC2 and read the latest Alexa spireding dump from S3
  436. # [14:01] <Philip`_> (I'm kind of thinking about multiple computers on a LAN with a fast internet connection, so the network wouldn't be a bottleneck when spreading stuff out)
  437. # [14:02] <hsivonen> I poked around the Amazon docs but I didn't find out if the Alexa dump can be easily read by URL instead of by handle obtained from Alexa search results
  438. # [14:02] <Philip`_> That sounds like a useful thing to investigate
  439. # [14:03] <hsivonen> Philip`_: anyway, you definitely want to keep the JVM up and running with multiple threads reading from sockets instead of invoking it again and again
  440. # [14:03] <hsivonen> I don't know where the other end of those sockets should be
  441. # [14:06] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
  442. # [14:08] <Philip`_> Perhaps the hardest bit is working out which pages to look at so that the sample is biased sensibly - I assume normal spiders just try to grab as much stuff as possible, which is not useful since they'll spend far too long in a few large sites
  443. # [14:09] <hsivonen> yeah, I think in principle we want to look at the Web breadth first, but not just front pages
  444. # [14:09] <Philip`_> and I would expect it's not possible to grab a large enough sample to do something like PageRank to find the interesting pages
  445. # [14:13] <Philip`_> (though maybe it wouldn't be too rubbish to just use the process which the original PageRank is modelling, where you follow random links and have a ~15% chance of getting bored and jumping to some other arbitrary page)
  446. # [14:15] * Joins: webben (i=benh@nat/yahoo/x-bd7f5d0228cb47d3)
  447. # [14:15] <hsivonen> cool. the IA crawler uses Commons HttpClient
  448. # [14:21] * Quits: webben (i=benh@nat/yahoo/x-bd7f5d0228cb47d3) (Read error: 104 (Connection reset by peer))
  449. # [14:21] * Joins: webben (i=benh@nat/yahoo/x-726aa07150f97726)
  450. # [14:26] <hsivonen> Philip`_: I encourage you to take a look at http://crawler.archive.org/
  451. # [14:33] * Joins: SavageX (n=maikmert@T63c3.t.pppool.de)
  452. # [14:33] * Joins: Ducki__ (n=Alex@dialin-212-144-064-058.pools.arcor-ip.net)
  453. # [14:51] * Quits: maikmerten (n=maikmert@T6eaf.t.pppool.de) (Read error: 110 (Connection timed out))
  454. # [14:53] * Quits: Ducki_ (n=Alex@dialin-145-254-189-168.pools.arcor-ip.net) (Read error: 110 (Connection timed out))
  455. # [15:26] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Read error: 104 (Connection reset by peer))
  456. # [15:26] * Joins: annevk (n=annevk@pat-tdc.opera.com)
  457. # [15:40] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Read error: 104 (Connection reset by peer))
  458. # [15:41] * Joins: annevk (n=annevk@pat-tdc.opera.com)
  459. # [15:43] * Quits: hendry (n=hendry@91.84.62.62) (Read error: 113 (No route to host))
  460. # [15:43] * Joins: hendry (n=hendry@91.84.62.62)
  461. # [15:44] * Quits: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com) (Read error: 110 (Connection timed out))
  462. # [15:51] * Joins: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com)
  463. # [16:05] * Quits: webben (i=benh@nat/yahoo/x-726aa07150f97726)
  464. # [16:16] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
  465. # [16:28] * Quits: billyjack (n=MikeSmit@eM60-254-242-228.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
  466. # [16:29] * Joins: tndH_ (i=Rob@83.100.252.160)
  467. # [16:30] * Joins: billyjack (n=MikeSmit@eM60-254-240-50.pool.emobile.ad.jp)
  468. # [16:33] * Joins: Ducki_ (i=Alex@dialin-145-254-188-006.pools.arcor-ip.net)
  469. # [16:37] * billyjack is now known as MikeSmith
  470. # [16:46] * Quits: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM) (Read error: 110 (Connection timed out))
  471. # [16:51] * Quits: hendry (n=hendry@91.84.62.62) ("brb")
  472. # [16:51] * Quits: Ducki__ (n=Alex@dialin-212-144-064-058.pools.arcor-ip.net) (Read error: 113 (No route to host))
  473. # [16:54] * Joins: hendry (n=hendry@91.84.62.62)
  474. # [17:27] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 110 (Connection timed out))
  475. # [17:43] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
  476. # [17:53] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) (Read error: 110 (Connection timed out))
  477. # [18:02] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
  478. # [18:04] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
  479. # [18:27] * Joins: hasather (n=hasather@22.80-203-71.nextgentel.com)
  480. # [18:34] * Joins: Ducki__ (n=Alex@dialin-145-254-189-020.pools.arcor-ip.net)
  481. # [18:42] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
  482. # [18:44] * Joins: duryodhan (n=chatzill@221-128-173-162.static.exatt.net)
  483. # [18:50] * Quits: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com) (Read error: 104 (Connection reset by peer))
  484. # [18:51] * Joins: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com)
  485. # [18:53] * Quits: Ducki_ (i=Alex@dialin-145-254-188-006.pools.arcor-ip.net) (Read error: 113 (No route to host))
  486. # [19:11] * Quits: BenWard (i=BenWard@nat/yahoo/x-36d10ff5536839e6) ("Fades out again…")
  487. # [19:23] * Philip`_ is now known as Philip`
  488. # [19:26] * Joins: webben (i=benh@nat/yahoo/x-9081a1806ada02c3)
  489. # [19:26] * Quits: hendry (n=hendry@91.84.62.62) ("vmware")
  490. # [19:32] * Joins: Codler (n=Codler@84-218-6-152.eurobelladsl.telenor.se)
  491. # [19:33] * Parts: hasather (n=hasather@22.80-203-71.nextgentel.com)
  492. # [19:35] * Joins: hasather (n=hasather@22.80-203-71.nextgentel.com)
  493. # [19:35] <annevk> http://html5.org/parsing-tests/testrunner.htm
  494. # [19:38] <annevk> lots of browser backing for ignoring </head>
  495. # [19:39] <annevk> but I guess that was already known
  496. # [19:40] <annevk> I suppose next would be some prefs so you can ignore IE <title> insertions
  497. # [19:50] * Joins: hendry (n=hendry@91.84.62.62)
  498. # [20:04] * Joins: tndH (i=Rob@83.100.252.160)
  499. # [20:15] * Quits: ROBOd (n=robod@86.34.246.154) ("http://www.robodesign.ro")
  500. # [20:18] * Quits: tndH_ (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
  501. # [20:20] * Joins: bzed (n=bzed@dslb-084-059-118-233.pools.arcor-ip.net)
  502. # [20:29] <jgraham> annevk: re: running python on my web server; the short answer is that I can't (that was in response to your message a few days ago)
  503. # [20:34] * Joins: Ducki_ (n=Alex@dialin-145-254-187-047.pools.arcor-ip.net)
  504. # [20:42] * Quits: Ducki__ (n=Alex@dialin-145-254-189-020.pools.arcor-ip.net) (Read error: 104 (Connection reset by peer))
  505. # [20:46] * Quits: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com) ("Don't touch /dev/null…")
  506. # [20:48] * Quits: Codler (n=Codler@84-218-6-152.eurobelladsl.telenor.se) (Client Quit)
  507. # [20:51] <annevk> jgraham, are you a registered user?
  508. # [20:51] <annevk> Philip`, zcorpan, you can now filter with http://html5.org/parsing-tests/testrunner.htm as well for IE specific quirks
  509. # [20:54] * annevk wonders what tantek will do next
  510. # [21:01] * Quits: webben (i=benh@nat/yahoo/x-9081a1806ada02c3) (Read error: 110 (Connection timed out))
  511. # [21:02] <annevk> Setting the flag makes a lot more pass in IE and Opera. Mostly because IE messes up both DOCTYPE and inserts <title> and because Opera does not include DOCTYPE at all
  512. # [21:03] <annevk> It also helps some for Firefox which always uppercases the tag name in the DOCTYPE
  513. # [21:04] <jgraham> annevk: Of freenode? No
  514. # [21:11] * Quits: SavageX (n=maikmert@T63c3.t.pppool.de) ("Leaving")
  515. # [21:19] <zcorpan> annevk: nice!
  516. # [21:25] <annevk> I fixed some further bugs and I'm going home now
  517. # [21:26] <annevk> I'll commit it tomorrow to one of the open source thingies we have
  518. # [21:26] <zcorpan> ok
  519. # [21:26] <annevk> now someone can write python scripts to iterate over those numbers browsers return...
  520. # [21:36] <Hixie> of the 50 or so sites I found with cycles in the headers="", all but three are government sites
  521. # [21:38] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  522. # [21:47] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  523. # [21:49] * Joins: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com)
  524. # [21:50] <mpt> How does that compare with the proportion of government sites without cycles in the headers?
  525. # [21:50] <mpt> (Not that I'm interested, it's just the basic "compared to what?" question)
  526. # [21:54] * Joins: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
  527. # [21:59] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net) (Remote closed the connection)
  528. # [22:01] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  529. # [22:01] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net) (Remote closed the connection)
  530. # [22:02] <Hixie> mpt: the fact that it's 50 basically means it's an insignificant number that have cycles
  531. # [22:04] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  532. # [22:06] <mpt> ok
  533. # [22:07] <Hixie> http://sixstar.cca.gov.tw/community/pages/01_about_people.php?CommID=1231&ID=1
  534. # [22:07] <Hixie> it's so hard to argue that that is a valid use of headers=""
  535. # [22:07] <Hixie> sigh
  536. # [22:08] <Hixie> with my proposed heuristic for the top left cell, if they changed that into an actual table it would actually work fine with implied scope=s
  537. # [22:11] <hsivonen> Hixie: btw, shouldn't scope be down, up, right, left (not row/column)
  538. # [22:12] <hsivonen> Hixie: if you have to rows of headers where the upper row applies to the lower row but not vice versa, shoudn't scope be down instead of column?
  539. # [22:14] <hsivonen> An end tag whose tag name is one of: "p", "br" is weird to have in "in head noscript"
  540. # [22:17] * Quits: zcorpan (n=zcorpan@84-216-43-119.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
  541. # [22:17] <zcorpan_> hsivonen: why?
  542. # [22:18] <Hixie> hsivonen: the values come from html4
  543. # [22:18] <hsivonen> zcorpan_: other stray end tags get ignored
  544. # [22:18] <hsivonen> Hixie: I know that excplicit one come from there but implicit ones don't have to
  545. # [22:18] <zcorpan_> hsivonen: not </p> or </br>
  546. # [22:19] <hsivonen> zcorpan_: yeah. like I said, weird
  547. # [22:19] <Hixie> hsivonen: there's only one implicit one, "auto", and it has no keyword
  548. # [22:19] <zcorpan_> hsivonen: not specific to in noscript in head though
  549. # [22:22] <Hixie> wow, some (very few) of the pages caused the AAA algorithm to create over 1000 clones for one stray end tag
  550. # [22:24] <hsivonen> Hixie: I hope that doesn't count as a reason to redesign the algorithm
  551. # [22:24] <Hixie> no, it's expected really
  552. # [22:24] <hsivonen> Hixie: what Safari does on those pages? what about Firefox or Opera?
  553. # [22:24] <Hixie> no idea, dunno which pages it is
  554. # [22:25] <Hixie> 355 billion invokations of the AAA algorithm resulted in zero clones
  555. # [22:26] <Hixie> 715 thousand invokations resulted in one clone
  556. # [22:26] <Hixie> er sorry
  557. # [22:26] <Hixie> 715 million
  558. # [22:26] <Hixie> 55 million resulted in 2 clones
  559. # [22:26] <Hixie> 10 million, 3 clones
  560. # [22:26] <Hixie> 3 million, 4 clones
  561. # [22:27] <Hixie> 800 thousand, 5 clones
  562. # [22:27] <Hixie> 460000 6 clones
  563. # [22:27] <gsnedders> Hixie: 1 billion == 1 million million or 1 thousand million?
  564. # [22:27] <Hixie> 237000 7 clones
  565. # [22:27] <Hixie> US billion, thousand million, 1e9
  566. # [22:28] * Quits: MikeSmith (n=MikeSmit@eM60-254-240-50.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
  567. # [22:28] <Hixie> less than 100,000 instances of hte AAA algorithm resulted in 11 clones
  568. # [22:28] <Hixie> i guess i should have gotten the total count
  569. # [22:28] <hsivonen> Hixie: cool. are you going to post this to public-html?
  570. # [22:28] <Hixie> to make this a useful number
  571. # [22:28] <Hixie> in due course
  572. # [22:29] * Philip` finds that writing the HTML5 tokeniser as an OCaml data structure and then printing C++ from it is perhaps slightly crazy, but doesn't seem entirely infeasible (though I've only got about a quarter of two states implemented so far...)
  573. # [22:30] <Hixie> wait this can't be right, according to separate data, there were only 900,000,000 invokations of the AAA
  574. # [22:30] <Hixie> oh, wrong number
  575. # [22:30] <Hixie> phew
  576. # [22:34] * Joins: Ducki__ (i=Alex@dialin-212-144-065-230.pools.arcor-ip.net)
  577. # [22:35] * Quits: tndH (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
  578. # [22:35] * Joins: tndH (i=Rob@83.100.252.160)
  579. # [22:43] <hsivonen> Hixie: I forgot to ask you this when you asked about instrumentation but did you record data on stack depth?
  580. # [22:44] <Hixie> yeah but it's biased because my parser bails after 64k elements
  581. # [22:45] <hsivonen> Hixie: what did you find?
  582. # [22:45] <Hixie> http://freechal.com/banilaB8 was one of the worst pages
  583. # [22:45] <Hixie> (that my parser didn't bail on)
  584. # [22:45] <hsivonen> Hixie: so you use a hard limit as well ;-)
  585. # [22:46] <Hixie> well i run out of bits to store the pointer in after 64k
  586. # [22:46] <hsivonen> the pointer?
  587. # [22:46] <Hixie> i have 64 bits to store the length of the text node, the offset of the text node, the pointer to the parent element, and some bits for e.g. if it's a comment node or a text node
  588. # [22:47] <Hixie> and the bit that points to the parent element has to also sit alongside the 24 bits i use for the element flags
  589. # [22:47] <Hixie> anyway
  590. # [22:48] <Hixie> the 50th percentile of the pages my parser didn't bail on had 16 or fewer nodes in its stack at the biggest point
  591. # [22:48] <Hixie> 99th percentile had 40 or less
  592. # [22:48] <Hixie> 100th percentil had 64k
  593. # [22:48] <hsivonen> Hixie: thanks
  594. # [22:48] <Hixie> i can get you more later but i really have to go shower
  595. # [22:49] * hsivonen does new StackNode[64]
  596. # [22:49] <Hixie> heh
  597. # [22:55] * Quits: Ducki_ (n=Alex@dialin-145-254-187-047.pools.arcor-ip.net) (Read error: 113 (No route to host))
  598. # [23:01] <Hixie> incidentally, the reason i used 64k as my limit is that i'm having to balance the number of text nodes with the number of elements
  599. # [23:01] <Hixie> right now my text nodes are 32k max each
  600. # [23:01] <Hixie> i could make them 16k each but have 128k elements, but it turns out that, anecdotally, to process any significantly greater number of pages, i'd have to add many many bits
  601. # [23:01] <Hixie> like 4, or 5
  602. # [23:02] <Hixie> whereas there are many pages with more than 32k characters at once
  603. # [23:02] <Hixie> i suspect that the pathological cases with deep stacks are all cases of bad interactions with AAA
  604. # [23:02] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
  605. # [23:04] * Quits: Ducki__ (i=Alex@dialin-212-144-065-230.pools.arcor-ip.net) (Read error: 113 (No route to host))
  606. # [23:05] * Philip` wonders why Opera says "XML parsing failed" when loading http://html5.org/parsing-tests/data/tests3.dat
  607. # [23:06] <Philip`> Oh, how odd, it works when I reload...
  608. # [23:09] <zcorpan_> Philip`: because it thinks anything loaded through XHR is XML
  609. # [23:09] <zcorpan_> Philip`: and then remembers that
  610. # [23:09] <Hixie> bbl
  611. # [23:11] <Philip`> zcorpan_: Ah, that seems to make as much sense as could be expected
  612. # [23:14] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
  613. # [23:16] <hsivonen> do these statements have a significant difference "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until an element with that tag name has been popped from the stack." and "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until the stack no longer has an element with the same tag nam
  614. # [23:17] <Hixie> yes
  615. # [23:17] <hsivonen> ok
  616. # [23:17] <Hixie> it differs if the stack has two elements of that name in it
  617. # [23:17] <Hixie> e.g.
  618. # [23:17] <Hixie> <div><div>
  619. # [23:17] <Hixie> however typically the second wording is only used for elements that can't be twice on the stack
  620. # [23:17] <Hixie> in which case it doesn't matter
  621. # [23:18] <hsivonen> Hixie: how do you get two nested <p> elements is scope?
  622. # [23:18] <Hixie> i don't think you can
  623. # [23:19] * Parts: hasather (n=hasather@22.80-203-71.nextgentel.com)
  624. # [23:19] <hsivonen> Hixie: ok. thanks. I'll send email. Every time you use a different wording for no good reason, I have to stop and think. :-)
  625. # [23:20] <Hixie> thinking is good! :-)
  626. # [23:21] <Hixie> bbl
  627. # [23:29] * aroben is now known as aroben|food
  628. # [23:30] * Quits: aroben|food (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  629. # [23:50] * Quits: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za) ("Leaving")
  630. # [23:50] * Joins: weinig (i=weinig@nat/apple/x-88c022b759e253c0)
  631. # [23:53] * Joins: aroben|food (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  632. # [23:54] * aroben|food is now known as aroben
  633. # [23:55] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Client Quit)
  634. # [23:56] * Joins: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca)
  635. # [23:59] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
  636. # Session Close: Thu Jul 05 00:00:00 2007

The end :)