Options:
- # Session Start: Wed Jul 04 00:00:00 2007
- # Session Ident: #whatwg
- # [00:00] * Quits: KevinMarks (i=KevinMar@nat/google/x-2987d34f5000d2a1) ("The computer fell asleep")
- # [00:10] * Joins: KevinMarks (i=KevinMar@nat/google/x-ea66512c0f090208)
- # [00:12] <zcorpan_> annevk: are there tests on things like </p>, <html></p>, <head></p>, etc, in the html5lib tests?
- # [00:13] <zcorpan_> public-html starts to get pretty high traffic again
- # [00:24] <Hixie> typical longdesc: http://130.83.47.128/masterfiles/descriptions/logo.txt
- # [00:24] <webben> typical of what?
- # [00:25] <Hixie> typical of the longdescs that are actually not completely bogus
- # [00:25] <Hixie> (that's from http://130.83.47.128/vv/ss/comments/13.205.en.tud)
- # [00:25] <Hixie> (the first one on my list of "interesting" uses)
- # [00:26] <webben> not a terrible longdesc I suppose
- # [00:26] <webben> distinguishing between alternate text and explaining what the image is
- # [00:26] <Hixie> <a href="http://www.google.co.jp/">
- # [00:26] <Hixie> <img src="http://blog2.fc2.com/2/20century/file/Logo_20s.gif" alt="Google" height="75" width="143" longdesc="http://www.google.co.jp/logos.html" /></a>
- # [00:26] <webben> shame they didn't explain what the logo actually depicts
- # [00:27] * Hixie bangs head against table
- # [00:27] <jgraham> zcorpan_: I can't see any tests for those cases (htough I thought anne had checked some in...). If you want to add some I can add you to the html5lib members list
- # [00:28] <webben> Hixie: maybe the text is helpful for that one
- # [00:28] * webben can't read Japanese
- # [00:28] <webben> oh wait, Google can read Japanese
- # [00:28] <Philip`> But that logo.txt longdesc is in the wrong language for that page (which I guess could be because the site's developers had no way to actually test longdesc so it fell out of sync with the page contents)...
- # [00:28] <Hixie> from that en.tud page, lower down:
- # [00:28] <Hixie> <img src="/masterfiles/images/blue10x1.gif" alt="[Abstandhalter]" title="[Abstandhalter]" longdesc="/masterfiles/descriptions/abstandhalter.txt">
- # [00:28] <Hixie> guess what the "/masterfiles/descriptions/abstandhalter.txt" file contains
- # [00:28] <webben> Philip`: good point
- # [00:31] <Hixie> i think i've yet to see an actual useful, value use of longdesc="" in this study
- # [00:32] <Hixie> bbl
- # [00:32] <webben> Hixie: you should include uses of D-links
- # [00:32] <webben> since for a long time D-link was used as a longdesc alternative based on poor support for longdesc
- # [00:33] * Joins: weinig (i=weinig@nat/apple/x-a260b6922c3b12a6)
- # [00:34] * Quits: weinig_ (i=weinig@nat/apple/x-a4970a9ef18c9aca) (Read error: 104 (Connection reset by peer))
- # [00:34] <webben> see also: http://www.w3.org/TR/WCAG10-HTML-TECHS/#long-descriptions
- # [00:34] <webben> it would be interesting to know how many links in the wild have a value of D or [D] or similar
- # [00:34] <webben> s/value/text content/
- # [00:36] * Philip` wants to rewrite his own rubbish survey tool to be slightly less rubbish, so he can get vaguely interesting numbers about common features
- # [00:37] <webben> how many links ... and what they point to, of course
- # [00:37] * jgraham wants a google-scale cluster to run a survey on
- # [00:38] <jgraham> and a pony, of course
- # [00:39] <jgraham> But seriously, Philip`, it would be nice if your survey tool was more widely available. It would be even better if the parser was fast. I wonder if any of the HTML5-parser-in-C projects are going to produce something soon?
- # [00:40] <Philip`> At least my initial version taught me that SQLite is completely rubbish when you have concurrency - it kept throwing exceptions because the whole database was locked
- # [00:40] <Philip`> so I need to rewrite it with MySQL or something
- # [00:42] * Quits: the_mart (n=Martin@host86-135-9-158.range86-135.btcentralplus.com) ("Leaving")
- # [00:42] <Philip`> and I think it should do some simple crawling, rather than only looking at a fixed list of URLs, so it can find more stuff to look at
- # [00:43] <Philip`> (and a faster parser would definitely be useful :-) )
- # [00:44] * Joins: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca)
- # [00:45] <Philip`> (A Java one would probably be as good as a C one)
- # [00:47] <bewest> sounds like a bunch of people are interested in some kind of survey tool available to the community
- # [00:48] <webben> Here's a good example of longdesc-as-long-alternative: http://www.fhwa.dot.gov/hfl/framework/04.cfm referring to http://www.fhwa.dot.gov/hfl/framework/longdesc.cfm#fig1
- # [00:48] <bewest> purpose would be 2-fold, correct? 1.) survey useage of authoring techniques on the web. 2.) test parsers?
- # [00:49] <Philip`> 3.) Confirm whether Hixie's stats are reasonable, or if he's just making up all the numbers :-)
- # [00:50] <bewest> I've thought about doing this with ec2 and Alexa's web services
- # [00:50] <bewest> eg greptheweb, and MSR
- # [00:50] <bewest> alexa has crawled documents in s3
- # [00:51] <bewest> but that costs money
- # [00:52] <zcorpan_> jgraham: sure. i might check in this browser port too
- # [00:53] <zcorpan_> othermaciej: rewrote the function to not be recursive but still get the same error in safari
- # [00:53] <bewest> Philip`: so you already have some kind of survey tool? how does it work?
- # [00:54] <Philip`> bewest: Ah, I wasn't aware of those things, though I tend to never consider anything that requires money :-)
- # [00:55] <bewest> yeah...
- # [00:55] <bewest> usually I don't either
- # [00:55] <bewest> except that I work at the company that makes those services
- # [00:55] <Philip`> It was just something simple for things like http://canvex.lazyilluminati.com/misc/copyright.html and http://canvex.lazyilluminati.com/misc/summary.html
- # [00:56] <Philip`> (and a few other things which I can't remember where I put)
- # [00:56] <Philip`> where I give it a list of a few thousand URLs (from Yahoo search results for arbitrary terms), and it just downloads them then parses them (with html5lib) and looks for certain stuff
- # [00:57] <Philip`> (and sort of does those things in parallel, if you run lots of copies of the program, except most of the processes keep dying because SQLite gets unhappy)
- # [00:58] <Philip`> (and then some pages cause quadratic behaviour in html5lib and you have to manually delete them from the database)
- # [00:58] <Philip`> (so it's all just horribly hacked together :-p )
- # [00:59] <bewest> heh
- # [01:00] <othermaciej> zcorpan_: that's odd
- # [01:00] <othermaciej> zcorpan_: pointer?
- # [01:01] <zcorpan_> othermaciej: http://simon.html5.org/temp/html5lib-tests/wrapper.html
- # [01:01] <Hixie> webben: studying text contents is much harder for various reasons
- # [01:02] <webben> of course it's harder
- # [01:02] <webben> but given we're talking about what's basically a language for marking up text, such study is pretty critical
- # [01:03] <Hixie> be my guest :-)
- # [01:05] <othermaciej> zcorpan_: very confusing
- # [01:05] <othermaciej> zcorpan_: I'll try debugging it in a while - need to get coffee first
- # [01:05] <zcorpan_> othermaciej: ok
- # [01:06] <zcorpan_> man, i've really spent all day on this thing
- # [01:07] <Hixie> how does it feel to be paid to do this nonsense? :-)
- # [01:07] <jgraham> zcorpan_: You should now be able to commit to html5lib svn If you're committing tests that html5lib doesn't pass, it's really good to email html5lib-discuss@googlegroups.com so people know there hasn't been a regression
- # [01:08] <zcorpan_> Hixie: feels great :)
- # [01:08] <zcorpan_> jgraham: ok. thanks
- # [01:09] <Hixie> hey i guess working for opera also means you get w3c member access
- # [01:09] <zcorpan_> yeah
- # [01:09] <Hixie> now you can see the crazyness you've previously only been able to imagine
- # [01:10] <jgraham> zcorpan_: I think you need to join the html5lib-discuss group to post to it btw.
- # [01:10] <Philip`> Are you being paid to work on this at 1am? :-)
- # [01:10] <zcorpan_> Philip`: yep :)
- # [01:10] <zcorpan_> Philip`: plus, i work from home
- # [01:10] <zcorpan_> my work day starts when i want and ends when i want
- # [01:11] <Dashiva> h4x
- # [01:11] <zcorpan_> which is usually when i wake up and when i go to bed, respectively
- # [01:11] * othermaciej is now known as om_coffee
- # [01:11] <Dashiva> We have core time in Oslo
- # [01:13] <zcorpan_> Hixie: i read the pointers in http://ln.hixie.ch/?start=1172653243&count=1 but i haven't looked at other crazyness
- # [01:13] <Hixie> btw i'm going to be in oslo (though extremely tired) late next monday and early next tuesday
- # [01:13] <Hixie> i'll probably pop by the opera offices
- # [01:14] * zcorpan_ wonders if anyone will pop by the eskilstuna office
- # [01:15] <Dashiva> Just as I take two days off. I'm going to miss the munchkin playing, no doubt.
- # [01:19] <zcorpan_> anything interesting on public-html the past 24h?
- # [01:20] * Quits: billmason (n=billmaso@ip156.unival.com) (Read error: 104 (Connection reset by peer))
- # [01:20] * Quits: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM) ("ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508]")
- # [01:22] <Hixie> i just found this interesting tidbit:
- # [01:22] <Hixie> Tantek Ãelik (Microsoft): We are in the XHTML WG. I am the representative; recently it has become clear that the priorities of the XHTML WG are different from our priorities. We would like to see the HTML 4 and XHTML 1.x versions resolved. Most of the folks in the WG are XHTML 2 and that is not a priority for us.
- # [01:22] <Hixie> from http://www.w3.org/2004/04/webapps-cdf-ws/minutes-20040601.html
- # [01:22] <Hixie> Steven Pemberton (W3C/CWI): If you want that done, you have to do it.
- # [01:23] * Quits: kingryan (n=kingryan@corp.technorati.com) (Remote closed the connection)
- # [01:23] * Joins: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
- # [01:25] <tantek> Thanks for the memory Hixie :)
- # [01:25] <tantek> yes, that workshop is where everything "blew up" as the kids say
- # [01:25] <Hixie> indeed
- # [01:26] <Hixie> but i didn't realise that steven actually told us to go do html5
- # [01:26] <tantek> he didn't
- # [01:26] <tantek> he told you to go do html5, and me to go do microformats
- # [01:26] <tantek> he just didn't realize he did ;)
- # [01:26] <tantek> and yes, you're welcome for the setup :)
- # [01:27] <Hixie> :-)
- # [01:28] <tantek> out of that workshop i was more convinced than ever that I had to leave microsoft and pursue microformats wherever there was support for them, knowing that you would have a pretty good handle on the HTML 4.x XHTML 1.x updates.
- # [01:32] <tantek> Hixie, it wouldn't be inaccurate for you to even state that Microsoft's representative to that workshop called for work on HTML4 and XHTML1 along a set of requirements remarkably similar to those adopted by WHATWG.
- # [01:32] <Hixie> indeed
- # [01:32] <tantek> thereby confirming all the conspiracy theorists suspicions that WHATWG is merely doing Microsoft's bidding. ;)
- # [01:33] * Quits: weinig (i=weinig@nat/apple/x-a260b6922c3b12a6) (Read error: 104 (Connection reset by peer))
- # [01:33] <Hixie> oh the modern conspiracy theory is that it's google's attempt at getting around the problem that converting adsense to xhtml2 would be too hard
- # [01:33] <zcorpan_> LOL
- # [01:36] * Joins: weinig (i=weinig@nat/apple/x-3021b5e01346d7af)
- # [01:41] * Quits: hendry (n=hendry@91.84.62.62) ("nn")
- # [01:50] * om_coffee is now known as othermaciej
- # [01:52] * Quits: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
- # [02:05] * Joins: epeus (i=KevinMar@conference/plone/docsprint/x-ea4c9cc997546964)
- # [02:08] * Joins: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
- # [02:08] * Quits: KevinMarks (i=KevinMar@nat/google/x-ea66512c0f090208) (Nick collision from services.)
- # [02:08] * epeus is now known as KevinMarks
- # [02:10] * Joins: kingryan (n=kingryan@dsl081-240-149.sfo1.dsl.speakeasy.net)
- # [02:24] * Joins: weinig_ (i=weinig@nat/apple/x-1d2c33c52f79e762)
- # [02:24] * Joins: epeus (i=KevinMar@nat/google/x-55d456545ad17e99)
- # [02:25] * Quits: syp| (n=syp@lasigpc9.epfl.ch) (kubrick.freenode.net irc.freenode.net)
- # [02:25] * Quits: fuzzy76 (i=fuzzy76@matilda.td.org.uit.no) (kubrick.freenode.net irc.freenode.net)
- # [02:25] * Joins: syp| (n=syp@lasigpc9.epfl.ch)
- # [02:25] * Joins: fuzzy76 (i=fuzzy76@matilda.td.org.uit.no)
- # [02:25] * Quits: weinig (i=weinig@nat/apple/x-3021b5e01346d7af) (Read error: 104 (Connection reset by peer))
- # [02:27] * Quits: KevinMarks (i=KevinMar@conference/plone/docsprint/x-ea4c9cc997546964) (Nick collision from services.)
- # [02:27] * epeus is now known as KevinMarks
- # [02:30] * Quits: KevinMarks (i=KevinMar@nat/google/x-55d456545ad17e99) ("The computer fell asleep")
- # [02:31] <webben> Hixie: more vaguely sane long descriptions: http://www.tsu.ox.ac.uk/info/report.php
- # [02:32] <webben> (although I think they could have madeuse of data tables)
- # [02:33] <webben> another example: http://docs.sun.com/source/817-5763/
- # [02:34] <webben> in general, look through this search: http://www.google.co.uk/search?hl=en&q=%22long+description+for%22 for lots of longdesc examples
- # [02:36] <Hixie> my script uses the same source data as that search, basically
- # [02:39] * Quits: zcorpan_ (n=zcorpan@84-216-41-27.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
- # [02:46] * Philip` never knew that IE supports <comment>...</comment>
- # [02:47] <Philip`> (Interestingly the text appears to be not in the DOM, but is in the innerHTML view)
- # [02:55] * Quits: webben (n=benh@91.84.193.157)
- # [02:56] * Quits: jgraham (n=jgraham@81-86-214-45.dsl.pipex.com) (Read error: 110 (Connection timed out))
- # [03:02] * Joins: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp)
- # [03:13] * Quits: aroben (n=adamrobe@17.203.15.248)
- # [03:16] * Quits: weinig_ (i=weinig@nat/apple/x-1d2c33c52f79e762)
- # [03:28] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
- # [03:39] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
- # [03:45] * Joins: yod (n=ot@dhcp-247-181.mag.keio.ac.jp)
- # [03:52] * Joins: KevinMarks (n=KevinMar@c-76-102-254-252.hsd1.ca.comcast.net)
- # [03:58] * Joins: weinig (i=weinig@nat/apple/x-4db5afe5bef23360)
- # [04:07] * Joins: kfish (n=conrad@61.194.21.25)
- # [04:11] * Quits: tantek (n=tantek@corp.technorati.com)
- # [04:14] * Quits: h3h (n=w3rd@66-162-32-234.static.twtelecom.net) ("|")
- # [04:17] <Hixie> heh, i just noticed something about the press release the w3c put out when the charters were announced
- # [04:18] <othermaciej> yeah?
- # [04:18] <Hixie> it says:
- # [04:18] <Hixie> "With the chartering of the XHTML 2 Working Group, W3C will continue its technical work on the language at the same time it considers rebranding the technology to clarify its independence and value in the marketplace."
- # [04:19] <othermaciej> hah!
- # [04:20] * Quits: bzed (n=bzed@dslb-084-059-121-172.pools.arcor-ip.net) ("Leaving")
- # [04:20] <othermaciej> "dear xhtml2 wg, how is that rebranding coming along? love, the html wg"
- # [04:22] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [04:28] * Quits: MikeSmith (n=MikeSmit@eM60-254-215-75.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
- # [04:29] * Joins: MikeSmith (n=MikeSmit@eM60-254-213-154.pool.emobile.ad.jp)
- # [04:32] * Joins: Philip`_ (n=philip@zaynar.demon.co.uk)
- # [04:49] * Quits: Philip` (n=philip@zaynar.demon.co.uk) (Read error: 110 (Connection timed out))
- # [05:07] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [05:39] * Quits: Yudai (n=Yudai@p931010.tokyte00.ap.so-net.ne.jp) (Read error: 110 (Connection timed out))
- # [05:39] * Joins: Yudai (n=Yudai@pae3703.tokyte00.ap.so-net.ne.jp)
- # [05:44] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) ("ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502]")
- # [05:53] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
- # [06:07] * Quits: kingryan (n=kingryan@dsl081-240-149.sfo1.dsl.speakeasy.net)
- # [06:17] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
- # [06:24] * Quits: weinig (i=weinig@nat/apple/x-4db5afe5bef23360)
- # [06:35] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
- # [06:45] <hsivonen> annevk: I meant that when you've got a form control whose form pointer does not point to an ancestor and that doesn't have a form='' attribute pointing to the same node as the form pointer, generate an id attribute on the node pointed by the form pointer if there isn't an id already and generate a corresponding form='' attribute on the form control
- # [06:45] <hsivonen> annevk: this fails if the <form> element already has an id='' attribute and the value of that attribute is a duplicate
- # [06:51] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [06:51] * Quits: jcgregorio (n=chatzill@adsl-072-148-043-048.sip.rmo.bellsouth.net) ("ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007060115]")
- # [06:57] <hsivonen> othermaciej: Also I suggested the iterative DOM traversal algorithm to zcorpan, but does IE guarantee that the algorithm terminates? I think it doesn't.
- # [06:58] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [06:59] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
- # [06:59] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [07:01] <othermaciej> hsivonen: oh - good point, I'm not sure how it works in the face of a non-tree
- # [07:01] <othermaciej> hsivonen: I'm not sure what exactly IE's non-tree DOMs look like
- # [07:03] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [07:03] <hsivonen> othermaciej: this is one significant reason why a non-tree DOM sucks
- # [07:04] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [07:06] <othermaciej> hsivonen: I have seen a look of shocked realization on the faces of JS library authors when they heard that IE can do that
- # [07:07] <othermaciej> "that explains those weird infinite loop bugs!"
- # [07:07] <othermaciej> do you actually know what it does though?
- # [07:07] <othermaciej> is it just the parent pointer that can be wrong? you could work around that with a stack
- # [07:10] <Hixie> see my blog
- # [07:10] <Hixie> entries starting with "Tag Soup" iirc
- # [07:10] <Hixie> bbl
- # [07:11] * Quits: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca) ("http:/www.csarven.ca")
- # [07:14] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [07:16] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [07:17] <hsivonen> othermaciej: not sure. The edges between EM and ADDRESS in the Mac IE 5 DOM with Hixie's case look like the ingredients of an infinite loop: http://hsivonen.iki.fi/soup-dom/ (I can't test IE6 here.)
- # [07:22] <othermaciej> good lord, that's insane
- # [07:22] * othermaciej blames tantek
- # [07:23] <othermaciej> child pointer indicates presence in the childNodes array?
- # [07:24] <hsivonen> Philip`_: If you'd like to run surveys with something that runs as native instructions at run time, I suggest figuring out which Java spider framework can easily take a plugged HTML5 parser
- # [07:25] <othermaciej> hsivonen: it looks like traversal via firstChild/nextSibling/parentNode would not infinite loop on that, but it would miss some elements
- # [07:25] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [07:25] <othermaciej> wait, maybe it wouldn't even iss anything
- # [07:25] <hsivonen> Philip`_: the parser needs to get a java.io.InputStream, the value of the HTTP charset (null if absent), a SAX ErrorHandler and a SAX ContentHandler (for extracting links)
- # [07:25] <hsivonen> othermaciej: child is firstchild
- # [07:26] <hsivonen> othermaciej: IIRC
- # [07:26] <othermaciej> it can't be only firstChild, since you can't have multiple firstChilds
- # [07:26] <hsivonen> othermaciej: oh. right. can't rememeber anymore what I did
- # [07:28] <othermaciej> some nodes would be visited more than once I guess, w/ tree-based traversal
- # [07:29] <othermaciej> we have some ex-MacIE folks on our team, I could ask them what they were thinking :-)
- # [07:29] <hsivonen> Philip`_: the Internet Archive spider looks promising, but they seem to rely on the JVM running on Linux with a particular thread impl
- # [07:30] <hsivonen> Philip`_: btw, I wouldn't run a Java spider that used java.net.URLConnection without socket timeouts
- # [07:30] <hsivonen> I have more confidence in Commons HTTP Client
- # [07:31] <hsivonen> I haven't checked which HTTP client the Internet Archive spider uses
- # [08:10] <Hixie> hm, xmlns="...xhtml" usage has gone up to 20% according to the survey i just did (of several billion html docs)
- # [08:11] <Hixie> from about 15% about a year ago
- # [08:15] <Hixie> and 41% have no DOCTYPE, down from about 50% at the same time iirc
- # [08:16] <Hixie> 19% have the XHTML1 DOCTYPE, 11% have a 4.01 Transitional DOCTYPE with no URI
- # [08:17] <Hixie> 6% are 4.01 Transitional with URI
- # [08:19] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [08:24] * Quits: kfish (n=conrad@61.194.21.25) (Remote closed the connection)
- # [08:25] * Joins: kfish (n=conrad@61.194.21.25)
- # [08:36] <Hixie> and the 0.014% of XHTML usage has gone up to 0.062%
- # [08:37] <hsivonen> Hixie: real XHTML? as in a/x+x
- # [08:38] <hsivonen> Amazon EC2 was mentioned earlier. any actual experience with using it?
- # [08:47] * othermaciej is surprised to hear there's that many sites that give the finger to IE; or is that conditionally served?
- # [08:50] <Hixie> hsivonen: yeah
- # [08:50] <Hixie> othermaciej: might be conditional, dunno
- # [08:51] <hsivonen> Hixie: does Google unify multiple representations of a page if it finds foo with Content-Location, foo.html and foo.xhtml?
- # [08:54] <Hixie> duplicate elimination happens before my script gets hold of the data, yes, but i don't know exactly what gets counted as a dupe
- # [08:55] * Joins: peepo (n=Jay@86.157.113.34)
- # [08:56] <hsivonen> hmm. looks like Google has changed its behavior again and now http://hsivonen.iki.fi/thesis/html5-conformance-checker over .html or .xhtml. IIRC, it returned http://hsivonen.iki.fi/thesis/html5-conformance-checker.xhtml a couple of weeks ago
- # [08:58] <hsivonen> s/now/now prefers/
- # [08:59] <Hixie> it probably treats them separately and picks one based on which has the most "relevance"
- # [09:05] * Joins: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za)
- # [09:10] * Joins: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM)
- # [09:32] * Joins: BenWard (i=BenWard@nat/yahoo/x-36d10ff5536839e6)
- # [09:32] * Quits: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp) ("Where dwelt Ymir, or wherein did he find sustenance?")
- # [09:32] * Quits: yod (n=ot@dhcp-247-181.mag.keio.ac.jp) ("This computer has gone to sleep")
- # [09:59] * Joins: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
- # [09:59] * Joins: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com)
- # [10:15] * Joins: the_mart (n=Martin@host86-135-9-158.range86-135.btcentralplus.com)
- # [10:17] * Quits: peepo (n=Jay@86.157.113.34) ("later")
- # [10:21] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [10:24] <hsivonen> http://www.w3.org/mid/886507.69879.qm@web50802.mail.re2.yahoo.com
- # [10:26] * Joins: hendry (n=hendry@91.84.62.62)
- # [10:27] <annevk> http://lists.w3.org/Archives/Public/www-validator/2007Jul/0011.html
- # [10:27] <zcorpan_> oh of course. writing your own dtd makes you validate.
- # [10:28] <annevk> it's true
- # [10:28] <annevk> it's just not very smart
- # [10:28] * Quits: kfish (n=conrad@61.194.21.25) ("RW")
- # [10:28] * Joins: billyjack (n=MikeSmit@eM60-254-242-228.pool.emobile.ad.jp)
- # [10:29] <zcorpan_> might be if you really use validation as qa check, and you don't want to flag files that have 1 error you already know about and have to have around
- # [10:30] * Quits: MikeSmith (n=MikeSmit@eM60-254-213-154.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
- # [10:31] * Joins: webben (i=benh@nat/yahoo/x-c93aa498557bcb6c)
- # [10:42] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [10:45] * Joins: ROBOd (n=robod@86.34.246.154)
- # [10:45] * Quits: webben (i=benh@nat/yahoo/x-c93aa498557bcb6c)
- # [10:51] * Joins: webben (i=benh@nat/yahoo/x-7630519bda45a319)
- # [11:04] <Lachy> Hixie, yt?
- # [11:07] <annevk> zcorpan_, http://simon.html5.org/temp/html5lib-tests/dom2string.js doesn't seem to handle attributes
- # [11:08] <zcorpan_> annevk: oops
- # [11:09] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Remote closed the connection)
- # [11:10] * Joins: annevk (n=annevk@pat-tdc.opera.com)
- # [11:13] <zcorpan_> annevk: fixed
- # [11:18] <Hixie> Lachy: yo
- # [11:19] <Lachy> Hey Hixie, Marcos and I are working on the XBL Primer, and we're trying to come up with a concise description of what a template is. Any suggestions?
- # [11:20] <Hixie> it's some markup that will be used to render the bound element, i guess
- # [11:20] <Lachy> so far we have "A template is used to control the presentation of a document", but we want to say something about how it reorders content in the DOM, without altering it, using shadow trees, but without using technical terms
- # [11:20] <annevk> interesting, Opera returns uppercase attribute names
- # [11:21] <zcorpan_> annevk: yeah.
- # [11:21] <Hixie> Lachy: good luck
- # [11:21] <Lachy> thanks
- # [11:21] <Hixie> Lachy: my best attempt is what's in the spec
- # [11:21] <Hixie> Lachy: in the note in the definition of <template>
- # [11:22] <annevk> "A template defines the building blocks for the subtree of the bounding element."
- # [11:22] <Lachy> yeah, that's the problem :-)
- # [11:23] <Lachy> hmm. we could try and work something like that into it.
- # [11:24] <annevk> just say something and then illustrate it with some "easy" to grasp examples
- # [11:24] <Lachy> yeah, that's the idea
- # [11:27] <zcorpan_> hm. opera can have cdata nodes in the dom. how should i output those?
- # [11:27] <zcorpan_> "<![CDATA[ " + current.nodeValue + " ]]>" ?
- # [11:29] <annevk> yeah
- # [11:32] <zcorpan_> done
- # [11:38] <Hixie> i'm instrumenting my html parser to report how many times it clones nodes in the AAA and inline-reconstruction algorithms
- # [11:38] <Hixie> anything else i can instrument while i'm at it?
- # [11:39] <Hixie> hsivonen? annevk? jgraham?
- # [11:40] <annevk> we have some XXX comments about tokenization...
- # [11:41] <annevk> specifically which cases in states are the most frequent
- # [11:41] <annevk> so you can optimize those cases in some way...
- # [11:42] <annevk> other interesting things might be <form> nodes <form> where nodes does not include </form> and then do some browser testing on those more complicated examples from real world pages
- # [11:44] <Hixie> eh?
- # [11:45] <Hixie> i could emit for each tokeniser state the most common tokens seen, i guess
- # [11:46] <Hixie> it would make the parser way slower, but it could work
- # [11:46] <annevk> it's probably not very important
- # [11:46] * Joins: maikmerten (n=maikmert@T6eaf.t.pppool.de)
- # [11:46] <annevk> tree mutation and node duplication are more interesting
- # [11:47] <annevk> would be fun to count how often you encounter <canvas> nowadays :)
- # [11:49] <Hixie> i've looked at elements in a separate study
- # [11:50] <Hixie> canvas didn't appear in the top 200
- # [11:51] * zcorpan_ suspects that some <canvas>es are only output with script
- # [12:00] <annevk> k
- # [12:00] <zcorpan_> hmm. dom core doesn't specify an order for .attributes ... i need to sort them myself
- # [12:01] <annevk> I wonder if we have actually sorted them...
- # [12:03] <zcorpan_> opera and safari don't seem to sort them. ie seems to sort them alphabetically. firefox alphabetically reversed.
- # [12:03] <Hixie> ok i'm going to emit a list of total count of all the tokens
- # [12:04] <Hixie> for each kind of token in each insertion mode
- # [12:04] <Hixie> anything else?
- # [12:04] <Hixie> last chance before i set this off and go to bed...
- # [12:04] <annevk> ah, I actually meant characters I think
- # [12:04] <annevk> but that may be too expensive
- # [12:04] <Hixie> characters?
- # [12:04] <annevk> during tokenization
- # [12:04] <Hixie> how do you mean?
- # [12:05] <zcorpan_> see how often ">" (with quotes) appears in doctypes or bogus comments
- # [12:05] <annevk> so you can optimize a particular tokenization state
- # [12:05] <Hixie> oh i thought you wanted to optimise the tree constructor states
- # [12:06] <Hixie> zcorpan_: hm
- # [12:06] <hsivonen> Hixie: hmm. I guess there might be merit in instrumenting how often IN_BODY code runs with the actual insertion mode being one of the table modes other than caption and cell
- # [12:06] <Hixie> annevk: surely for the tokeniser it makes no difference since you'll just do table dispatch
- # [12:06] <annevk> IE has this nice <!- .... ">" more comment ... >
- # [12:07] <zcorpan_> Hixie: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012078.html
- # [12:07] <Hixie> hsivonen: you mean an average of times per page that the inbody state is invoked when the state is not inbody, incell, or incaption?
- # [12:07] <hsivonen> Hixie: is it even important to clone DOM nodes instead of using the attributes on the original token and creating a new DOM node using those?
- # [12:07] <Hixie> zcorpan_: yeah i'm just trying to work out how to do it
- # [12:07] <hsivonen> that is, do you really want to close concurrent attribute changes?
- # [12:08] <Hixie> i don't think the dom supports having attributes shared between nodes
- # [12:09] <hsivonen> Hixie: yes, the average times the table states actually fall though to in body
- # [12:09] <hsivonen> through
- # [12:12] <Hixie> ok, i'm logging the actual insertion mode when my inhead, inbody, and intable functions are invoked
- # [12:12] <hsivonen> Hixie: since that only happens in non-conforming cases and Java doesn't have goto, I let the code hit some useless branches when the fall-through happens
- # [12:12] <Hixie> hopefully they map exactly to the spec
- # [12:14] <Hixie> zcorpan_: for DOCTYPEs we don't care, right? since what the spec does matches IE anyway?
- # [12:14] <hsivonen> (A smart compiler could fix this, but I doubt javac or hotspot are that smart)
- # [12:14] <annevk> yeah, DOCTYPEs match IE
- # [12:14] <annevk> it's just that IE uses the same mode for bogus comments as they use for DOCTYPEs it seems
- # [12:15] <Hixie> i'm gonna bail on working out what characters are most common in each tokeniser mode, on the principle that there are so few states it hardly matters anyway
- # [12:15] <zcorpan_> Hixie: not quite. the spec doesn't handle <!doctype ">" >
- # [12:15] <annevk> oops
- # [12:15] <zcorpan_> Hixie: the spec only matches ie if the > is in an actual FPI or SPI
- # [12:16] <hsivonen> Hixie: oh yeah, one more thing for optimization: whether an average stack node is tested for being in a group of element names more than once
- # [12:17] <Hixie> well i didn't find any DOCTYPEs with > in their name part, at least not enough to appear on my radar in the scan of doctypes i did earlier this week
- # [12:17] <hsivonen> Hixie: that is, whether it makes sense to have a boolean on a stack node that says for example whether the node is a table context sentinel
- # [12:17] <zcorpan_> Hixie: ok
- # [12:17] <zcorpan_> Hixie: isn't that because > in the name part terminates the doctype? :)
- # [12:18] <hsivonen> Hixie: or whether a stack node should have a flag for phrasing OR formatting OR div OR address
- # [12:18] <Hixie> sorry, i meant "
- # [12:18] <zcorpan_> ah
- # [12:18] <zcorpan_> ok
- # [12:18] <Hixie> hsivonen: so what i did with that is that each well-known tag name has an integer associated with it (like an atom) and for each special feature that the parser cares about i used a bit
- # [12:19] <Hixie> i used 24 bits for these flags
- # [12:20] <Hixie> so for example all the <hx> elements have the number 0x400008400000
- # [12:20] <hsivonen> Hixie: my strategy is to intern well-known names so that testing against one name is a comparison of memory addresses but still testing if a name is in a group means as many comparisons as names names in group
- # [12:20] <Hixie> the leading 0x4 is "element" (as opposed to text node), the 8 is "hx node", and the 4 is "closes <p> elements"
- # [12:21] <Hixie> yeah so my parser never compares tag names once they're in the stack
- # [12:21] <Hixie> doing string compares was prohibitively expensive
- # [12:21] <hsivonen> interesting
- # [12:21] <Hixie> i just use the integer that says whether a node is a text node, comment node, doctype, etc, to say what special kind of element it is too
- # [12:22] <Hixie> and so everything is always exactly one & and exactly one ==
- # [12:22] * Joins: Ducki (n=Alex@dialin-145-254-186-173.pools.arcor-ip.net)
- # [12:23] <annevk> and you construct those numbers during tokenization?
- # [12:23] <hsivonen> I guess I'll complete the tree builder with my current approach and will leave a tokenizer-assigned bitfield as a later interface-breaking optimization
- # [12:24] <Hixie> annevk: whenever i create a node, i create it withe the appropriate constant
- # [12:24] <Hixie> the tokeniser doesn't know about these
- # [12:24] <Hixie> it emits tokens with tag names
- # [12:24] <Hixie> it's only when i create nodes that i use these
- # [12:24] <hsivonen> Hixie: ooh. so "closes p" is not assigned in the tokenizer after all
- # [12:24] <annevk> ok, so the tree construction stage does use string comparison?
- # [12:25] <Hixie> yeah, tokens are string-compared
- # [12:25] <Hixie> but i think my compiler might be atomising them
- # [12:25] <Hixie> so it's not such a big deal
- # [12:27] <hsivonen> I'm currently using the generic String.intern(), but I figured how to make a fast interning function with knowledge about the possible names (three-level switch: length, last char, second to last char)
- # [12:27] <hsivonen> but typing that is too much work
- # [12:27] <hsivonen> so I guess I'll write a small Python program that generates Java code for the interning function at some point
- # [12:28] <Hixie> zcorpan_: given that only IE does this, I'm going to assume it's not a big deal. I can investigate it in more detail later maybe. Don't want to hack the parser too much tonight. :-)
- # [12:28] <Hixie> beware that the names are unbounded
- # [12:28] <Hixie> <fiv> is an element name that is seen in the wild, e.g.
- # [12:28] <Hixie> you don't want to treat it as <div>
- # [12:29] <Hixie> especially in your case :-)
- # [12:30] <hsivonen> Hixie: of if the length is > 2, the prefix needs to be compared, too, to make sure
- # [12:30] <hsivonen> Hixie: still better than an intermediate copy to java.lang.String
- # [12:31] <hsivonen> Hixie: the idea is to weed out all but one prefix candidate
- # [12:31] <Hixie> ah cool
- # [12:33] * Joins: Ducki_ (n=Alex@dialin-145-254-189-168.pools.arcor-ip.net)
- # [12:36] <Hixie> right sleep time
- # [12:36] <Hixie> nn
- # [12:36] <hsivonen> nn
- # [12:37] * Quits: Ducki (n=Alex@dialin-145-254-186-173.pools.arcor-ip.net) (Read error: 113 (No route to host))
- # [12:41] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 104 (Connection reset by peer))
- # [12:56] * Joins: zcorpan (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
- # [13:03] * Quits: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
- # [13:20] * Quits: webben (i=benh@nat/yahoo/x-7630519bda45a319)
- # [13:23] * Joins: webben (i=benh@nat/yahoo/x-a060493131c95b1e)
- # [13:26] <zcorpan> the parser test format doesn't distinguish between an "" attrubute and a text node "=" (e.g.: <p "">"="</p>)
- # [13:26] <zcorpan> | <p>
- # [13:26] <zcorpan> | ""=""
- # [13:26] <zcorpan> | ""=""
- # [13:26] * Quits: webben (i=benh@nat/yahoo/x-a060493131c95b1e) (Client Quit)
- # [13:27] <annevk> that's not too relevant though
- # [13:27] <annevk> but an interesting edge case
- # [13:28] <zcorpan> perhaps " in text nodes should be escaped with \?
- # [13:28] <annevk> why?
- # [13:29] <zcorpan> so you can tell the difference between attributes and text nodes. but perhaps it doesn't matter
- # [13:30] <annevk> just don't mix them
- # [13:32] <annevk> also, if you make mistakes in your parser at that level you've got bigger issues :)
- # [13:33] <zcorpan> which parser?
- # [13:33] <annevk> HTML parser?
- # [13:33] <zcorpan> ah. yeah.
- # [13:37] * Quits: ROBOd (n=robod@86.34.246.154) ("http://www.robodesign.ro")
- # [13:38] * Joins: mw22 (n=chatzill@h8441169151.dsl.speedlinq.nl)
- # [13:41] * Parts: mw22 (n=chatzill@h8441169151.dsl.speedlinq.nl)
- # [13:42] <Philip`_> hsivonen: I think it might be reasonable to keep the spidering and parsing completely separate, so they could be different languages (depending on what useful tools are available for), just communicating asynchronously through some database (which is probably necessary anyway to support parallelism)
- # [13:44] * Joins: ROBOd (n=robod@86.34.246.154)
- # [13:55] <hsivonen> Philip`_: I've never done wide-scale spidering. however, I would think that sticking stuff in a database in between would slow things significantly compared to the parser reading from the real socked when the spidering happens (possible with e.g. Commons HttpClient)
- # [13:57] <hsivonen> to me, it seems that the obvious way to implement this is to have a number of worker threads that run both the parser and the HTTP client and request URLs and report results to a centralized thread-safe coordination object
- # [13:57] <hsivonen> s/socked/socket/
- # [13:59] <hsivonen> as for tools in different languages, if you can't make everything run on a JVM, communicating through a local socket is more efficient that having an persistence layer in between
- # [13:59] <hsivonen> I am assuming here that we don't want to keep copies of the spidered bytes
- # [14:00] <Philip`_> It would be useful to allow the thing to run on multiple computers to spread the load out, and then it would need some network communication for coordination instead of just threads
- # [14:01] <hsivonen> Philip`_: it might be worth investigating if instead of running a spider we should run on EC2 and read the latest Alexa spireding dump from S3
- # [14:01] <Philip`_> (I'm kind of thinking about multiple computers on a LAN with a fast internet connection, so the network wouldn't be a bottleneck when spreading stuff out)
- # [14:02] <hsivonen> I poked around the Amazon docs but I didn't find out if the Alexa dump can be easily read by URL instead of by handle obtained from Alexa search results
- # [14:02] <Philip`_> That sounds like a useful thing to investigate
- # [14:03] <hsivonen> Philip`_: anyway, you definitely want to keep the JVM up and running with multiple threads reading from sockets instead of invoking it again and again
- # [14:03] <hsivonen> I don't know where the other end of those sockets should be
- # [14:06] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
- # [14:08] <Philip`_> Perhaps the hardest bit is working out which pages to look at so that the sample is biased sensibly - I assume normal spiders just try to grab as much stuff as possible, which is not useful since they'll spend far too long in a few large sites
- # [14:09] <hsivonen> yeah, I think in principle we want to look at the Web breadth first, but not just front pages
- # [14:09] <Philip`_> and I would expect it's not possible to grab a large enough sample to do something like PageRank to find the interesting pages
- # [14:13] <Philip`_> (though maybe it wouldn't be too rubbish to just use the process which the original PageRank is modelling, where you follow random links and have a ~15% chance of getting bored and jumping to some other arbitrary page)
- # [14:15] * Joins: webben (i=benh@nat/yahoo/x-bd7f5d0228cb47d3)
- # [14:15] <hsivonen> cool. the IA crawler uses Commons HttpClient
- # [14:21] * Quits: webben (i=benh@nat/yahoo/x-bd7f5d0228cb47d3) (Read error: 104 (Connection reset by peer))
- # [14:21] * Joins: webben (i=benh@nat/yahoo/x-726aa07150f97726)
- # [14:26] <hsivonen> Philip`_: I encourage you to take a look at http://crawler.archive.org/
- # [14:33] * Joins: SavageX (n=maikmert@T63c3.t.pppool.de)
- # [14:33] * Joins: Ducki__ (n=Alex@dialin-212-144-064-058.pools.arcor-ip.net)
- # [14:51] * Quits: maikmerten (n=maikmert@T6eaf.t.pppool.de) (Read error: 110 (Connection timed out))
- # [14:53] * Quits: Ducki_ (n=Alex@dialin-145-254-189-168.pools.arcor-ip.net) (Read error: 110 (Connection timed out))
- # [15:26] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Read error: 104 (Connection reset by peer))
- # [15:26] * Joins: annevk (n=annevk@pat-tdc.opera.com)
- # [15:40] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Read error: 104 (Connection reset by peer))
- # [15:41] * Joins: annevk (n=annevk@pat-tdc.opera.com)
- # [15:43] * Quits: hendry (n=hendry@91.84.62.62) (Read error: 113 (No route to host))
- # [15:43] * Joins: hendry (n=hendry@91.84.62.62)
- # [15:44] * Quits: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com) (Read error: 110 (Connection timed out))
- # [15:51] * Joins: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com)
- # [16:05] * Quits: webben (i=benh@nat/yahoo/x-726aa07150f97726)
- # [16:16] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
- # [16:28] * Quits: billyjack (n=MikeSmit@eM60-254-242-228.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
- # [16:29] * Joins: tndH_ (i=Rob@83.100.252.160)
- # [16:30] * Joins: billyjack (n=MikeSmit@eM60-254-240-50.pool.emobile.ad.jp)
- # [16:33] * Joins: Ducki_ (i=Alex@dialin-145-254-188-006.pools.arcor-ip.net)
- # [16:37] * billyjack is now known as MikeSmith
- # [16:46] * Quits: tndH (i=Rob@adsl-87-102-93-12.karoo.KCOM.COM) (Read error: 110 (Connection timed out))
- # [16:51] * Quits: hendry (n=hendry@91.84.62.62) ("brb")
- # [16:51] * Quits: Ducki__ (n=Alex@dialin-212-144-064-058.pools.arcor-ip.net) (Read error: 113 (No route to host))
- # [16:54] * Joins: hendry (n=hendry@91.84.62.62)
- # [17:27] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 110 (Connection timed out))
- # [17:43] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
- # [17:53] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) (Read error: 110 (Connection timed out))
- # [18:02] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
- # [18:04] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
- # [18:27] * Joins: hasather (n=hasather@22.80-203-71.nextgentel.com)
- # [18:34] * Joins: Ducki__ (n=Alex@dialin-145-254-189-020.pools.arcor-ip.net)
- # [18:42] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
- # [18:44] * Joins: duryodhan (n=chatzill@221-128-173-162.static.exatt.net)
- # [18:50] * Quits: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com) (Read error: 104 (Connection reset by peer))
- # [18:51] * Joins: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com)
- # [18:53] * Quits: Ducki_ (i=Alex@dialin-145-254-188-006.pools.arcor-ip.net) (Read error: 113 (No route to host))
- # [19:11] * Quits: BenWard (i=BenWard@nat/yahoo/x-36d10ff5536839e6) ("Fades out again…")
- # [19:23] * Philip`_ is now known as Philip`
- # [19:26] * Joins: webben (i=benh@nat/yahoo/x-9081a1806ada02c3)
- # [19:26] * Quits: hendry (n=hendry@91.84.62.62) ("vmware")
- # [19:32] * Joins: Codler (n=Codler@84-218-6-152.eurobelladsl.telenor.se)
- # [19:33] * Parts: hasather (n=hasather@22.80-203-71.nextgentel.com)
- # [19:35] * Joins: hasather (n=hasather@22.80-203-71.nextgentel.com)
- # [19:35] <annevk> http://html5.org/parsing-tests/testrunner.htm
- # [19:38] <annevk> lots of browser backing for ignoring </head>
- # [19:39] <annevk> but I guess that was already known
- # [19:40] <annevk> I suppose next would be some prefs so you can ignore IE <title> insertions
- # [19:50] * Joins: hendry (n=hendry@91.84.62.62)
- # [20:04] * Joins: tndH (i=Rob@83.100.252.160)
- # [20:15] * Quits: ROBOd (n=robod@86.34.246.154) ("http://www.robodesign.ro")
- # [20:18] * Quits: tndH_ (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
- # [20:20] * Joins: bzed (n=bzed@dslb-084-059-118-233.pools.arcor-ip.net)
- # [20:29] <jgraham> annevk: re: running python on my web server; the short answer is that I can't (that was in response to your message a few days ago)
- # [20:34] * Joins: Ducki_ (n=Alex@dialin-145-254-187-047.pools.arcor-ip.net)
- # [20:42] * Quits: Ducki__ (n=Alex@dialin-145-254-189-020.pools.arcor-ip.net) (Read error: 104 (Connection reset by peer))
- # [20:46] * Quits: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com) ("Don't touch /dev/null…")
- # [20:48] * Quits: Codler (n=Codler@84-218-6-152.eurobelladsl.telenor.se) (Client Quit)
- # [20:51] <annevk> jgraham, are you a registered user?
- # [20:51] <annevk> Philip`, zcorpan, you can now filter with http://html5.org/parsing-tests/testrunner.htm as well for IE specific quirks
- # [20:54] * annevk wonders what tantek will do next
- # [21:01] * Quits: webben (i=benh@nat/yahoo/x-9081a1806ada02c3) (Read error: 110 (Connection timed out))
- # [21:02] <annevk> Setting the flag makes a lot more pass in IE and Opera. Mostly because IE messes up both DOCTYPE and inserts <title> and because Opera does not include DOCTYPE at all
- # [21:03] <annevk> It also helps some for Firefox which always uppercases the tag name in the DOCTYPE
- # [21:04] <jgraham> annevk: Of freenode? No
- # [21:11] * Quits: SavageX (n=maikmert@T63c3.t.pppool.de) ("Leaving")
- # [21:19] <zcorpan> annevk: nice!
- # [21:25] <annevk> I fixed some further bugs and I'm going home now
- # [21:26] <annevk> I'll commit it tomorrow to one of the open source thingies we have
- # [21:26] <zcorpan> ok
- # [21:26] <annevk> now someone can write python scripts to iterate over those numbers browsers return...
- # [21:36] <Hixie> of the 50 or so sites I found with cycles in the headers="", all but three are government sites
- # [21:38] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [21:47] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [21:49] * Joins: gsnedders (n=gsnedder@host81-132-88-104.range81-132.btcentralplus.com)
- # [21:50] <mpt> How does that compare with the proportion of government sites without cycles in the headers?
- # [21:50] <mpt> (Not that I'm interested, it's just the basic "compared to what?" question)
- # [21:54] * Joins: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
- # [21:59] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net) (Remote closed the connection)
- # [22:01] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [22:01] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net) (Remote closed the connection)
- # [22:02] <Hixie> mpt: the fact that it's 50 basically means it's an insignificant number that have cycles
- # [22:04] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [22:06] <mpt> ok
- # [22:07] <Hixie> http://sixstar.cca.gov.tw/community/pages/01_about_people.php?CommID=1231&ID=1
- # [22:07] <Hixie> it's so hard to argue that that is a valid use of headers=""
- # [22:07] <Hixie> sigh
- # [22:08] <Hixie> with my proposed heuristic for the top left cell, if they changed that into an actual table it would actually work fine with implied scope=s
- # [22:11] <hsivonen> Hixie: btw, shouldn't scope be down, up, right, left (not row/column)
- # [22:12] <hsivonen> Hixie: if you have to rows of headers where the upper row applies to the lower row but not vice versa, shoudn't scope be down instead of column?
- # [22:14] <hsivonen> An end tag whose tag name is one of: "p", "br" is weird to have in "in head noscript"
- # [22:17] * Quits: zcorpan (n=zcorpan@84-216-43-119.sprayadsl.telenor.se) (Read error: 110 (Connection timed out))
- # [22:17] <zcorpan_> hsivonen: why?
- # [22:18] <Hixie> hsivonen: the values come from html4
- # [22:18] <hsivonen> zcorpan_: other stray end tags get ignored
- # [22:18] <hsivonen> Hixie: I know that excplicit one come from there but implicit ones don't have to
- # [22:18] <zcorpan_> hsivonen: not </p> or </br>
- # [22:19] <hsivonen> zcorpan_: yeah. like I said, weird
- # [22:19] <Hixie> hsivonen: there's only one implicit one, "auto", and it has no keyword
- # [22:19] <zcorpan_> hsivonen: not specific to in noscript in head though
- # [22:22] <Hixie> wow, some (very few) of the pages caused the AAA algorithm to create over 1000 clones for one stray end tag
- # [22:24] <hsivonen> Hixie: I hope that doesn't count as a reason to redesign the algorithm
- # [22:24] <Hixie> no, it's expected really
- # [22:24] <hsivonen> Hixie: what Safari does on those pages? what about Firefox or Opera?
- # [22:24] <Hixie> no idea, dunno which pages it is
- # [22:25] <Hixie> 355 billion invokations of the AAA algorithm resulted in zero clones
- # [22:26] <Hixie> 715 thousand invokations resulted in one clone
- # [22:26] <Hixie> er sorry
- # [22:26] <Hixie> 715 million
- # [22:26] <Hixie> 55 million resulted in 2 clones
- # [22:26] <Hixie> 10 million, 3 clones
- # [22:26] <Hixie> 3 million, 4 clones
- # [22:27] <Hixie> 800 thousand, 5 clones
- # [22:27] <Hixie> 460000 6 clones
- # [22:27] <gsnedders> Hixie: 1 billion == 1 million million or 1 thousand million?
- # [22:27] <Hixie> 237000 7 clones
- # [22:27] <Hixie> US billion, thousand million, 1e9
- # [22:28] * Quits: MikeSmith (n=MikeSmit@eM60-254-240-50.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
- # [22:28] <Hixie> less than 100,000 instances of hte AAA algorithm resulted in 11 clones
- # [22:28] <Hixie> i guess i should have gotten the total count
- # [22:28] <hsivonen> Hixie: cool. are you going to post this to public-html?
- # [22:28] <Hixie> to make this a useful number
- # [22:28] <Hixie> in due course
- # [22:29] * Philip` finds that writing the HTML5 tokeniser as an OCaml data structure and then printing C++ from it is perhaps slightly crazy, but doesn't seem entirely infeasible (though I've only got about a quarter of two states implemented so far...)
- # [22:30] <Hixie> wait this can't be right, according to separate data, there were only 900,000,000 invokations of the AAA
- # [22:30] <Hixie> oh, wrong number
- # [22:30] <Hixie> phew
- # [22:34] * Joins: Ducki__ (i=Alex@dialin-212-144-065-230.pools.arcor-ip.net)
- # [22:35] * Quits: tndH (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
- # [22:35] * Joins: tndH (i=Rob@83.100.252.160)
- # [22:43] <hsivonen> Hixie: I forgot to ask you this when you asked about instrumentation but did you record data on stack depth?
- # [22:44] <Hixie> yeah but it's biased because my parser bails after 64k elements
- # [22:45] <hsivonen> Hixie: what did you find?
- # [22:45] <Hixie> http://freechal.com/banilaB8 was one of the worst pages
- # [22:45] <Hixie> (that my parser didn't bail on)
- # [22:45] <hsivonen> Hixie: so you use a hard limit as well ;-)
- # [22:46] <Hixie> well i run out of bits to store the pointer in after 64k
- # [22:46] <hsivonen> the pointer?
- # [22:46] <Hixie> i have 64 bits to store the length of the text node, the offset of the text node, the pointer to the parent element, and some bits for e.g. if it's a comment node or a text node
- # [22:47] <Hixie> and the bit that points to the parent element has to also sit alongside the 24 bits i use for the element flags
- # [22:47] <Hixie> anyway
- # [22:48] <Hixie> the 50th percentile of the pages my parser didn't bail on had 16 or fewer nodes in its stack at the biggest point
- # [22:48] <Hixie> 99th percentile had 40 or less
- # [22:48] <Hixie> 100th percentil had 64k
- # [22:48] <hsivonen> Hixie: thanks
- # [22:48] <Hixie> i can get you more later but i really have to go shower
- # [22:49] * hsivonen does new StackNode[64]
- # [22:49] <Hixie> heh
- # [22:55] * Quits: Ducki_ (n=Alex@dialin-145-254-187-047.pools.arcor-ip.net) (Read error: 113 (No route to host))
- # [23:01] <Hixie> incidentally, the reason i used 64k as my limit is that i'm having to balance the number of text nodes with the number of elements
- # [23:01] <Hixie> right now my text nodes are 32k max each
- # [23:01] <Hixie> i could make them 16k each but have 128k elements, but it turns out that, anecdotally, to process any significantly greater number of pages, i'd have to add many many bits
- # [23:01] <Hixie> like 4, or 5
- # [23:02] <Hixie> whereas there are many pages with more than 32k characters at once
- # [23:02] <Hixie> i suspect that the pathological cases with deep stacks are all cases of bad interactions with AAA
- # [23:02] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
- # [23:04] * Quits: Ducki__ (i=Alex@dialin-212-144-065-230.pools.arcor-ip.net) (Read error: 113 (No route to host))
- # [23:05] * Philip` wonders why Opera says "XML parsing failed" when loading http://html5.org/parsing-tests/data/tests3.dat
- # [23:06] <Philip`> Oh, how odd, it works when I reload...
- # [23:09] <zcorpan_> Philip`: because it thinks anything loaded through XHR is XML
- # [23:09] <zcorpan_> Philip`: and then remembers that
- # [23:09] <Hixie> bbl
- # [23:11] <Philip`> zcorpan_: Ah, that seems to make as much sense as could be expected
- # [23:14] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
- # [23:16] <hsivonen> do these statements have a significant difference "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until an element with that tag name has been popped from the stack." and "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until the stack no longer has an element with the same tag nam
- # [23:17] <Hixie> yes
- # [23:17] <hsivonen> ok
- # [23:17] <Hixie> it differs if the stack has two elements of that name in it
- # [23:17] <Hixie> e.g.
- # [23:17] <Hixie> <div><div>
- # [23:17] <Hixie> however typically the second wording is only used for elements that can't be twice on the stack
- # [23:17] <Hixie> in which case it doesn't matter
- # [23:18] <hsivonen> Hixie: how do you get two nested <p> elements is scope?
- # [23:18] <Hixie> i don't think you can
- # [23:19] * Parts: hasather (n=hasather@22.80-203-71.nextgentel.com)
- # [23:19] <hsivonen> Hixie: ok. thanks. I'll send email. Every time you use a different wording for no good reason, I have to stop and think. :-)
- # [23:20] <Hixie> thinking is good! :-)
- # [23:21] <Hixie> bbl
- # [23:29] * aroben is now known as aroben|food
- # [23:30] * Quits: aroben|food (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [23:50] * Quits: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za) ("Leaving")
- # [23:50] * Joins: weinig (i=weinig@nat/apple/x-88c022b759e253c0)
- # [23:53] * Joins: aroben|food (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # [23:54] * aroben|food is now known as aroben
- # [23:55] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Client Quit)
- # [23:56] * Joins: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca)
- # [23:59] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
- # Session Close: Thu Jul 05 00:00:00 2007
The end :)