/irc-logs / freenode / #whatwg / 2007-07-05 / end

Options:

# Session Start: Thu Jul 05 00:00:00 2007
# Session Ident: #whatwg
# [00:10] * Quits: tndH (i=Rob@83.100.252.160) ("ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508]")
# [00:29] * Joins: MikeSmith (n=MikeSmit@eM60-254-213-126.pool.emobile.ad.jp)
# [00:46] * Quits: hendry (n=hendry@91.84.62.62) ("sleep")
# [00:56] * Joins: tantek (n=tantek@m810f36d0.tmodns.net)
# [01:10] * Quits: tantek (n=tantek@m810f36d0.tmodns.net)
# [01:18] * Quits: duryodhan (n=chatzill@221-128-173-162.static.exatt.net) (Read error: 110 (Connection timed out))
# [01:32] * moeffju is now known as moeffju[ZzZz]
# [01:45] * Joins: tantek (n=tantek@c-24-6-138-86.hsd1.ca.comcast.net)
# [02:02] * Joins: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp)
# [02:02] * Quits: tantek (n=tantek@c-24-6-138-86.hsd1.ca.comcast.net)
# [02:02] * Quits: bzed (n=bzed@dslb-084-059-118-233.pools.arcor-ip.net) ("Leaving")
# [02:47] * Parts: zcorpan_ (n=zcorpan@84-216-43-119.sprayadsl.telenor.se)
# [02:47] * Quits: the_mart (n=Martin@host86-135-9-158.range86-135.btcentralplus.com) ("Leaving")
# [02:56] * Joins: kfish (n=conrad@61.194.21.25)
# [02:59] <Philip`> Does http://canvex.lazyilluminati.com/misc/imagedata.html crash Opera 9.5? (I can only test via Opera Mini, which just says "Internal server error", which sounds potentially worrying but not very informative)
# [02:59] * Quits: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca) (Read error: 110 (Connection timed out))
# [03:01] * Quits: MikeSmith (n=MikeSmit@eM60-254-213-126.pool.emobile.ad.jp) (Read error: 104 (Connection reset by peer))
# [03:02] <othermaciej> does Opera Mini handle events?
# [03:03] <othermaciej> and scripting?
# [03:04] <Philip`> It seems to, as long as you don't use setInterval and don't expect it to wait for distant timeouts
# [03:05] <Philip`> (i.e. it can handle scripting and events and stuff while the page is loading, for some definition of 'loading' that I haven't quite worked out, though then it justs sends a static copy to your phone)
# [03:05] <Philip`> *just
# [03:06] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) ("ChatZilla 0.9.78.1 [Firefox 2.0.0.4/2007051502]")
# [03:06] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
# [03:07] <othermaciej> so script runs at load time but not afterwards?
# [03:08] <Philip`> Yes (as far as I can tell)
# [03:09] <othermaciej> (I'm playing with the Opera Mini simulator)
# [03:09] <Philip`> (since it basically opens the page in Opera on their servers, then at some point it decides it's got enough and transmits a non-interactive compressed snapshot, I think)
# [03:09] <Philip`> (Me too, since my real phone is far too rubbish :-) )
# [03:11] <Philip`> I got it to run ~100 canvas tests in iframes on a single page, and that (eventually) worked correctly with all the scripting and loading and stuff, but it wouldn't let me correctly press the buttons to submit the test results, so I had to do that via a hard-coded timer :-(
# [03:23] * Joins: MikeSmith (n=MikeSmit@eM60-254-197-237.pool.emobile.ad.jp)
# [03:25] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
# [03:34] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
# [03:40] * Joins: yod (n=ot@dhcp-247-181.mag.keio.ac.jp)
# [03:50] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:04] * othermaciej is now known as om_out
# [04:09] * Quits: kfish (n=conrad@61.194.21.25) ("同志社")
# [04:15] * Quits: MikeSmith (n=MikeSmit@eM60-254-197-237.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
# [04:21] * Joins: MikeSmith (n=MikeSmit@eM60-254-214-154.pool.emobile.ad.jp)
# [04:22] * Quits: MikeSmith (n=MikeSmit@eM60-254-214-154.pool.emobile.ad.jp) (Read error: 104 (Connection reset by peer))
# [04:26] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:26] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
# [04:26] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:26] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
# [04:27] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:27] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
# [04:35] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:35] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
# [04:46] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [04:48] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Client Quit)
# [04:48] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [05:01] * Quits: mpt (n=mpt@121-72-128-43.dsl.telstraclear.net) ("Leaving")
# [05:28] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
# [05:30] * aroben is now known as aroben|food
# [05:31] * Quits: aroben|food (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [05:53] * Joins: MikeSmith (n=MikeSmit@eM60-254-215-244.pool.emobile.ad.jp)
# [05:53] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
# [05:58] * Quits: weinig (i=weinig@nat/apple/x-88c022b759e253c0)
# [06:14] * Joins: mpt (n=mpt@121-72-128-43.dsl.telstraclear.net)
# [06:19] <mpt> "For example, don’t put a 100 x 100 image in a 10 x 10 <image> element." -- unintentionally hilarious iPhone developer docs
# [06:20] * Joins: wild_cfo (n=wild_c_f@ool-44c1bb48.dyn.optonline.net)
# [06:27] * Joins: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
# [06:29] <mpt> Ah, interesting: "ensure that width * height * 4 < 8 MB" ... so apparently this <image> element is for some new kind of file that has widths and heights measured in MBm⁻².
# [06:37] <mpt> But hooray for this: "Don’t use JavaScript movie controls to play video on iPhone. iPhone supplies its own controls."
# [06:54] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (kubrick.freenode.net irc.freenode.net)
# [06:54] * Quits: annevk (n=annevk@pat-tdc.opera.com) (kubrick.freenode.net irc.freenode.net)
# [06:54] * Quits: Philip` (n=philip@zaynar.demon.co.uk) (kubrick.freenode.net irc.freenode.net)
# [07:03] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [07:08] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
# [07:08] * Joins: annevk (n=annevk@pat-tdc.opera.com)
# [07:08] * Joins: Philip` (n=philip@zaynar.demon.co.uk)
# [07:29] * Joins: duryodhan (n=chatzill@221.128.138.137)
# [07:36] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net) (Remote closed the connection)
# [07:36] * Joins: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [08:08] * Joins: hendry (n=hendry@91.84.62.62)
# [08:15] * Quits: hendry (n=hendry@91.84.62.62) ("wrongkernel")
# [08:32] * Joins: hendry (n=hendry@91.84.62.62)
# [08:39] * Joins: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
# [08:56] * Quits: weinig (n=weinig@c-67-188-89-242.hsd1.ca.comcast.net)
# [08:59] <om_out> mpt: width * height * 4 bytes
# [08:59] * om_out is now known as othermaciej
# [09:03] * Joins: Ducki (n=Alex@dialin-145-254-189-142.pools.arcor-ip.net)
# [09:07] <hsivonen> Hixie: http://www.w3.org/mid/A0F10D3A-A679-4BB1-8844-684FBFDB94F6@iki.fi is there a way for the stack have td or th in such a position that generating implied end tags could close the scope (except for the EOF case)?
# [09:19] * Joins: tndH (i=Rob@83.100.252.160)
# [09:20] * Joins: webben (n=benh@dip5-fw.corp.ukl.yahoo.com)
# [09:24] <annevk> hehe, iPhone docs promote <image> :)
# [09:24] <hsivonen> annevk: URL?
# [09:24] <annevk> http://developer.apple.com/iphone/designingcontent.html
# [09:25] <annevk> click on "Use Standards and Tried-and-True Design Practices" and then search
# [09:27] <othermaciej> I'll report a bug
# [09:31] <hsivonen> annevk: did you try to optimize redundant steps in tree building at all or did you just follow the spec to letter even if it asked you to traverse the stack more than absolutely necessary?
# [09:32] <annevk> there are some small optimizations
# [09:32] <annevk> but not much
# [09:32] <annevk> doesn't really matter a lot in Python I've the feeling
# [09:33] * Quits: karlUshi (n=karl@dhcp-247-173.mag.keio.ac.jp) ("Where dwelt Ymir, or wherein did he find sustenance?")
# [09:33] <annevk> well, in the beginning we tried to reduce function calls by using dictionaries instead of token objects and such and that worked pretty well
# [09:33] <hsivonen> annevk: what's your take on the the ability of "generate end tags" to close the scope?
# [09:33] <annevk> but now with the treebuilder abstraction we gained a lot of function calls again :(
# [09:33] * Quits: yod (n=ot@dhcp-247-181.mag.keio.ac.jp) ("This computer has gone to sleep")
# [09:34] <annevk> http://html5lib.googlecode.com/svn/trunk/python/src/html5lib/treebuilders/_base.py search for "generateImpliedEndTags"
# [09:35] <annevk> although I now see it has some XXX comment that we never hit apparently...
# [09:35] <hsivonen> annevk: I was thinking of doing the exact same thing: just popping
# [09:35] <hsivonen> I guess I have to send another email
# [09:36] <annevk> Hixie recently added a bunch of table elements there
# [09:36] <annevk> I'm not sure what that was about
# [09:37] <hsivonen> annevk: I think that was about EOF
# [09:37] <hsivonen> I am not sure that it is a good idea to put them in that part of the spec
# [09:37] <hsivonen> annevk: does Python turn tail recursion into looping?
# [09:38] <annevk> dunno
# [09:38] * Quits: webben (n=benh@dip5-fw.corp.ukl.yahoo.com)
# [09:38] <annevk> http://html5.org/tools/web-apps-tracker?from=964&to=965
# [09:39] <annevk> is that for <table><tbody><tr><td><p><tbody> or something?
# [09:40] <annevk> doesn't seem like it, that already works
# [09:41] <hsivonen> the only case where I see those mattering is the EOF case
# [09:41] <annevk> example markup?
# [09:42] * annevk reads http://en.wikipedia.org/wiki/Tail_recursion and understands we might be able to optimize stuff a bit
# [09:44] <annevk> hmm, seems only to matter if it calls itself a lot
# [09:46] <annevk> hsivonen, I don't see how it matters for EOF either
# [09:46] <annevk> hsivonen, you always get a single error and that can't be avoided, because </table> is never implied
# [09:47] <hsivonen> annevk: good point. will you send email or shall I?
# [09:48] <annevk> you're already going pretty good with your review, you do it ;)
# [09:49] <hsivonen> annevk: ok
# [09:52] * Joins: met_ (n=Hassman@r5bx220.net.upc.cz)
# [09:53] <met_> http://www.bluishcoder.co.nz/2007/07/patch-for-video-element-support-in.html
# [09:55] <Hixie> hsivonen: i don't know (re <td>s)
# [09:56] <hsivonen> Hixie: that doesn't sound good ;-)
# [09:57] <Hixie> the table elements were added because it seemed wrong that they not be on the list
# [09:57] <Hixie> i honestly don't know if they'll ever get hit
# [09:57] <Hixie> i want to say no
# [09:57] <Hixie> but i'm not sure how to prove it
# [09:58] <Hixie> i'll be back in about 12 hours
# [09:58] <Hixie> (and possibly briefly in a few minutes)
# [09:58] <hsivonen> Hixie: I'd prefer to pretend that we proved that they never get hit
# [10:00] <annevk> <tbody> gets ignored outside <table>, inside <table> it is handled explicitly in each table phase
# [10:00] <annevk> I wonder if the same goes for <td> and <tr>
# [10:01] <annevk> I'm pretty sure they never get hit either
# [10:01] <annevk> lets test that with the tests we got...
# [10:02] <hsivonen> annevk: tr, td and th start tags are ignored "in body"
# [10:02] <annevk> indeed
# [10:02] <annevk> if I remove "td", "th", "tr" from our generate implied end tags algorithm nothing goes wrong
# [10:03] <annevk> because the table phases already deal with them
# [10:03] <hsivonen> annevk: the end tags seem to fall under "An end tag token not covered by the previous entries", but that seems wrong
# [10:03] <annevk> only "dd", "dt", "li", "p" are important
# [10:03] <annevk> actually, if I remove "p" nothing fails either...
# [10:03] * annevk ponders
# [10:04] <hsivonen> annevk: removing p seem wrong
# [10:04] <hsivonen> hmm. perhaps the An end tag token not covered by the previous entries
# [10:04] <hsivonen> still does the right thing "in body" for cell ends
# [10:04] <annevk> ah, the problem is that we don't count errors I suppose
# [10:05] <annevk> as removing <li> also "works"
# [10:05] <annevk> they are catched by the alternative algorithm that generates parse errors and therefore still generate the same tree...
# [10:06] <hsivonen> IIRC, in fragment cases some "act as if" consistently produce 0 or 2 errors. I think I may have changed some of those to emit 0 or 1 errors
# [10:17] * Joins: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za)
# [10:28] <annevk> how does "If the stack of open elements has a p element in scope, then generate implied end tags, except for p elements." even make sense?
# [10:28] <annevk> it says that when you encounter </p>
# [10:29] <annevk> however, you will never generate an implied end tag for <dd>, <dt> or <li> or any o the table cells as they can never be between the <p> that is in scope and the current node
# [10:37] <annevk> innerHTML wouldn't change anything for that either
# [10:48] <hsivonen> annevk: excellent point
# [10:49] <hsivonen> annevk: I'll email again.
# [10:59] * Joins: Ducki_ (n=Alex@dialin-212-144-055-153.pools.arcor-ip.net)
# [11:04] * Joins: BenWard (i=BenWard@nat/yahoo/x-4b53abbbd5c94177)
# [11:08] <hsivonen> should the list of active formatting elements be implemented as an array or as a linked list?
# [11:09] <hsivonen> is it searched much more often than a node is removed from the middle?
# [11:13] <hsivonen> Hixie: was you stat for "invocations of the AAA" exactly this? (that is, is the answer array?)
# [11:14] <hsivonen> oh that counted cloning nodes
# [11:14] <hsivonen> Hixie: did you count changing the size of the list by deleting stuff in the middle?
# [11:17] * Joins: zcorpan_ (n=zcorpan@84-216-41-39.sprayadsl.telenor.se)
# [11:18] * Quits: Ducki (n=Alex@dialin-145-254-189-142.pools.arcor-ip.net) (Read error: 113 (No route to host))
# [11:20] <hsivonen> annevk: does the algorithm for "in body" "An end tag token not covered by the previous entries" make sense to you?
# [11:20] <hsivonen> step 2.3. makes no sense to me
# [11:22] <annevk> what's 2.3?
# [11:22] <hsivonen> Pop all the nodes from the current node up to node, including node, then stop this algorithm.
# [11:23] <hsivonen> First: Initialise node to be the current node (the bottommost node of the stack).
# [11:23] <hsivonen> ok makes sense
# [11:23] <hsivonen> #
# [11:23] <hsivonen> If node has the same tag name as the end tag token, then:
# [11:23] <hsivonen> #
# [11:23] <hsivonen> Generate implied end tags.
# [11:23] <hsivonen> ok, makes sense
# [11:23] <hsivonen> now Pop all the nodes from the current node up to node, including node, then stop this algorithm.
# [11:23] <annevk> oh, I was looking at the wrong algorithm duh
# [11:24] <hsivonen> how could /node/ not already be popped or be the current node?
# [11:24] <hsivonen> shouldn't that be a simple unconditional pop
# [11:25] * Quits: aroben (n=adamrobe@c-67-160-250-192.hsd1.ca.comcast.net)
# [11:25] <hsivonen> umm. not unconditional but pop if the current node is /node/
# [11:25] <annevk> <foo><bar><baz></foo>
# [11:26] <annevk> would pop <baz> and <bar> and <foo>
# [11:26] * Joins: maikmerten (n=maikmert@T63c3.t.pppool.de)
# [11:26] <hsivonen> annevk: sorry for being dense, but I don't understand what step 2.3. has to do with it
# [11:27] <hsivonen> annevk: isn't step 4. what causes that?
# [11:27] <hsivonen> actually, step 2.1. makes no sense to me, either
# [11:27] <annevk> indeed
# [11:28] <annevk> I wonder how we managed to implement it :)
# [11:29] <hsivonen> time to send mail again
# [11:29] <annevk> we implemented what was mentioned
# [11:30] <annevk> which doesn't make much sense :(
# [11:30] <zcorpan_> can you provide a markup snippet that highlights the difference?
# [11:31] <hsivonen> zcorpan_: the difference?
# [11:31] <annevk> <foo>...</foo> is the only case that 2.1 covers
# [11:31] <annevk> in which case you don't need to generate implied end tags etc.
# [11:31] <annevk> you just need to pop
# [11:31] <zcorpan_> ah
# [11:31] <zcorpan_> indeed
# [11:31] <hsivonen> lunch
# [11:31] <hsivonen> then email
# [11:58] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) ("Leaving")
# [12:00] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
# [12:01] <annevk> I think I'm done with public-html for the day
# [12:05] * Joins: ROBOd (n=robod@86.34.246.154)
# [12:06] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) ("Leaving")
# [12:07] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
# [12:10] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) (Client Quit)
# [12:11] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
# [12:11] <hsivonen> annevk: did you my email about the catch-all end tag case, though? did it make sense?
# [12:13] * Quits: MikeSmith (n=MikeSmit@eM60-254-215-244.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
# [12:20] <annevk> yes
# [12:23] <hsivonen> ok. thanks.
# [12:43] <annevk> having said that, I'm not sure the algorithm is correct
# [12:43] <annevk> oh wait
# [12:44] <annevk> hsivonen, it does make sense
# [12:44] * annevk just realized
# [12:44] <annevk> hsivonen, because of step 5
# [12:44] <annevk> hsivonen, and step 4
# [12:44] <annevk> hsivonen, they change "node"
# [12:45] <annevk> so say you have <dialog><dd></dialog>
# [12:45] <annevk> you get to 4
# [12:46] <annevk> node becomes <dialog>
# [12:46] <annevk> </dd> is implied
# [12:46] <annevk> done
# [12:46] <annevk> however, it's questionable whether this is correct given that current UAs don't generate implied end tags in those cases...
# [12:50] <hsivonen> annevk: well, this certainly looks like something that needs another look by Hixie
# [12:53] <annevk> it seems that for <foo> </foo> it doesn't make much sense
# [12:54] <annevk> well, it seems that you can optimize for <foo> </foo>
# [12:54] <annevk> it does make sense in a twisted way
# [12:56] <hsivonen> annevk: looks like you aren't done for the day after all :-/
# [12:58] * Quits: hendry (n=hendry@91.84.62.62) ("leaving")
# [12:59] * Quits: Ducki_ (n=Alex@dialin-212-144-055-153.pools.arcor-ip.net) (Read error: 104 (Connection reset by peer))
# [12:59] * Joins: Ducki_ (n=Alex@dialin-145-254-186-023.pools.arcor-ip.net)
# [13:12] <hsivonen> I'd like to try to avoid ad hominems, but I'm intrigued that the insistence on a small improvement with great cost comes from an economist
# [13:21] * Joins: MikeSmith (n=MikeSmit@eM60-254-212-208.pool.emobile.ad.jp)
# [13:25] <annevk> that discussion is just painful
# [13:35] <zcorpan_> authors provide fallback to <object>?
# [13:36] * zcorpan_ won't join that discussion
# [13:38] * moeffju[ZzZz] is now known as moeffju
# [13:41] <annevk> hsivonen, yeah :-/
# [13:41] <annevk> these people should join some browser development project and learn about the web a little bit
# [13:54] <zcorpan_> annevk: did you check in the parser-tests thing somewhere?
# [13:54] * Quits: virtuelv (n=virtuelv@pat-tdc.opera.com) (Read error: 104 (Connection reset by peer))
# [13:54] * Joins: yod (n=ot@softbank221018155222.bbtec.net)
# [13:55] * Quits: yod (n=ot@softbank221018155222.bbtec.net) (Remote closed the connection)
# [13:55] * Joins: virtuelv (n=virtuelv@pat-tdc.opera.com)
# [13:55] * Joins: yod (n=ot@softbank221018155222.bbtec.net)
# [13:55] * Quits: yod (n=ot@softbank221018155222.bbtec.net) (Remote closed the connection)
# [13:56] <annevk> not yet
# [13:56] * Joins: yod (n=ot@softbank221018155222.bbtec.net)
# [13:56] * annevk was fixing html5lib
# [13:56] <zcorpan_> ok
# [13:56] <annevk> you want it checked in somewhere?
# [13:57] <zcorpan_> would be nice, in case i feel like improving it
# [13:58] <zcorpan_> no rush though
# [14:01] * Joins: karlUshi (n=karl@124-144-94-188.rev.home.ne.jp)
# [14:02] <annevk> it's in the html5 project now
# [14:02] <annevk> including a README that says to modify the tests from html5lib, not the ones included
# [14:02] <annevk> karlUshi, seen http://html5.org/parsing-tests/testrunner.htm already?
# [14:02] <annevk> karlUshi, you might like it
# [14:03] * Philip` wonders if anyone really cares what input like &#4294967366; gets parsed into
# [14:04] <annevk> FFFD
# [14:04] <annevk> U+FFFD
# [14:04] <Philip`> Is it worth having tests for that kind of thing? (Or are there ones already?)
# [14:04] <Philip`> (Firefox gets it wrong and says "F")
# [14:05] <Lachy> I wonder why it does that
# [14:05] <annevk> maybe a limit
# [14:05] <Philip`> (and so does my non-serious not-really-implemented tokeniser)
# [14:05] <annevk> we have tokenizer tests
# [14:06] <Philip`> Probably by doing "int n; ... n = n*10 + (next_char - '0')" or something and not caring about overflow
# [14:06] <Lachy> looks like it's a limit of 1 0000 0000 base 16
# [14:06] <annevk> Opera and IE get it right
# [14:07] <Philip`> FF also parses &#4294967295; into #4294967295;
# [14:08] <annevk> oops
# [14:08] * Philip` doesn't expect this is a likely place for real-world interoperability concerns
# [14:08] <annevk> I suppose that explains how much time reverse engineering costs and that it isn't really worth checking what other browsers do all the time
# [14:09] <hsivonen> if there's anything long about longdesc, it is the email threads
# [14:09] <annevk> :p
# [14:10] <hsivonen> Philip`: that's why you should have an integer overflow guard in your loop that consumes NCRs
# [14:10] * hsivonen has one
# [14:10] <Philip`> I just have a TODO comment stuck in there :-)
# [14:10] <Philip`> and I have another similar comment telling me to implement the non-numeric entity things too
# [14:10] <hsivonen> Philip`: which programming language?
# [14:11] <hsivonen> Philip`: Ocaml?
# [14:11] <Philip`> but I'm not particularly interested in making things actually work at the moment
# [14:11] <Philip`> OCaml generating C++
# [14:11] <hsivonen> cool
# [14:11] <Philip`> (Also OCaml generating .dot files so I can make nice graphs of the tokeniser state transitions)
# [14:11] <annevk> we solved it by having a try statement around the string to int conversion
# [14:12] <hsivonen> if (value < 0) {
# [14:12] <hsivonen> value = 0x110000; // Value above Unicode range but within int
# [14:12] <hsivonen> // range
# [14:12] <hsivonen> }
# [14:13] * Philip` just wants to see what's possible when you have the tokeniser algorithm as a data structure that you can process, instead of being English text or unprocessable program code
# [14:13] * Quits: MikeSmith (n=MikeSmit@eM60-254-212-208.pool.emobile.ad.jp) (Read error: 104 (Connection reset by peer))
# [14:13] <hsivonen> (value is signed)
# [14:18] <annevk> Philip`, will you consider implementing all the other fancy stuff as well?
# [14:18] <annevk> or just tokenizing?
# [14:22] <Philip`> That depends on how impossible the rest of it looks :-)
# [14:23] <annevk> by the time Hixie addresses hsivonen's comments nobody will have to think about it anymore :p
# [14:23] <Philip`> The tokeniser is fairly straightforward, since you can just represent the whole thing as a dozen state variables and some functions that match certain states and have transitions into new states
# [14:23] <annevk> now I think of it, that might make it too boring for some!
# [14:24] <Philip`> (The tree construction looks more complex than that, though I haven't looked at it in any detail)
# [14:24] <annevk> tree construction is actually similar
# [14:24] <annevk> although currently it has this concept called insertion mode which makes it look more complicated
# [14:24] <annevk> you can actually implement it as a bunch of states as well
# [14:25] <annevk> the difference being that you have some other set of variables and pass tokens around instead of characters
# [14:26] <Philip`> Would I be right in thinking the only way the content model flag can change outside the tokeniser is when explicitly emitting a start tag?
# [14:27] <annevk> yeah
# [14:27] <annevk> hsivonen, removing "td", "th" and "tr" from generate implied end tags does indeed not give any parse error differences
# [14:28] <annevk> hsivonen, removing "p", however, gives 45
# [14:29] <hsivonen> Philip`: it's just that start tags "in body" have a lot of stuff to type
# [14:30] * annevk is amazed at Robert's ability to not understand
# [14:36] * Joins: MikeSmith (n=MikeSmit@eM60-254-202-189.pool.emobile.ad.jp)
# [14:36] * Philip` reaches the bogus comment state, and finds that it totally doesn't match his way of writing the algorithm
# [14:37] <annevk> markup open declaration did?
# [14:38] <annevk> you should be able to implement those as functions I guess; separate from the states
# [14:38] <Philip`> The problem is that it sounds like it needs to look backwards and know what happened before that state was reached
# [14:39] <Philip`> The markup declaration open state is just after the bogus comment state, so I haven't got that far yet :-)
# [14:41] <annevk> don't you have a character queue or something?
# [14:42] <annevk> then you just make sure the right chars are on the stack before switching to the state
# [14:44] <hsivonen> Philip`: you may find my impl useful to look at
# [14:49] <annevk> zcorpan_, in case you missed it: http://html5.googlecode.com/svn/trunk/parser-tests/
# [14:51] * Quits: maikmerten (n=maikmert@T63c3.t.pppool.de) (Read error: 110 (Connection timed out))
# [14:51] * Joins: maikmerten (n=maikmert@T72ea.t.pppool.de)
# [14:52] <zcorpan_> annevk: saw it, cheers
# [14:53] <Philip`> Oh, I think my confusion comes from e.g. "<?" transitioning to the bogus comment state after consuming the '?', whereas "<!x" transitions before consuming the 'x', and the BCS can't tell the difference
# [14:54] <annevk> doesn't it say "unconsume" somewhere?
# [14:56] <Philip`> Not that I can see
# [14:56] <Philip`> but I can work around it by just moving the consumption around to the right places
# [14:58] <hsivonen> Philip`: I think Hixie cut corners when writing the spec. I had a bug there that the unit tests revealed
# [14:58] <hsivonen> Philip`: basically, you need to start filling the bogus comment buffer before you make the actual state transition
# [14:59] * Joins: Ducki__ (n=Alex@dialin-145-254-180-253.pools.arcor-ip.net)
# [14:59] * Quits: karlUshi (n=karl@124-144-94-188.rev.home.ne.jp) ("Where dwelt Ymir, or wherein did he find sustenance?")
# [14:59] * Joins: Codler (n=Codler@84-218-7-44.eurobelladsl.telenor.se)
# [15:01] <Philip`> "(If the comment was started by the end of the file (EOF), the token is empty.)" - isn't it also empty if the comment was started by a > character?
# [15:02] <Philip`> Hmm, I'll wait until later to sort out the details and make it actually work properly and pass the tests :-)
# [15:02] <Philip`> (since the current implementation is totally not executable, which makes it hard to test)
# [15:03] * Quits: BenWard (i=BenWard@nat/yahoo/x-4b53abbbd5c94177) (Read error: 104 (Connection reset by peer))
# [15:03] * Quits: yod (n=ot@softbank221018155222.bbtec.net) ("Leaving")
# [15:04] * Joins: BenWard (i=BenWard@nat/yahoo/x-424721520e41d982)
# [15:05] * Quits: BenWard (i=BenWard@nat/yahoo/x-424721520e41d982) (Read error: 104 (Connection reset by peer))
# [15:06] * Joins: BenWard (i=BenWard@nat/yahoo/x-851c38bdf86ef319)
# [15:07] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 110 (Connection timed out))
# [15:12] <annevk> Philip`, yeah, then it's also empty
# [15:16] * Quits: Ducki_ (n=Alex@dialin-145-254-186-023.pools.arcor-ip.net) (Read error: 113 (No route to host))
# [15:21] * Joins: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au)
# [15:29] <Philip`> http://canvex.lazyilluminati.com/misc/states.png - incomplete and quite possibly with bugs, but it looks kind of interesting
# [15:34] * Philip` should probably skip all the EOF bits since they're not very interesting and they make the diagram too complex
# [15:36] <Lachy> in the whole fallback content thread, has anyone actually given a use case for needing fallback beyond plain text? All I've seen are unsupported claims that it's needed.
# [15:37] <hsivonen> Philip`: cool. the diagram makes the transitions look more complex than they actually are
# [15:38] <hsivonen> Philip`: in fact there are only two transitions that break a stack assumption
# [15:39] <Philip`> hsivonen: Is that two when not counting all the reconsume-EOF-in-the-data-state ones?
# [15:39] <hsivonen> Lachy: if you want to get rid of longdesc and move the essay about the Union Jack or the dress of Lord Cornwallis inline
# [15:40] * Quits: Toolskyn (i=toolskyn@amy.bdick.de) (Remote closed the connection)
# [15:40] <hsivonen> Philip`: reconsume whatever in data state works as a stack transition
# [15:40] * Joins: Toolskyn (i=toolskyn@amy.bdick.de)
# [15:40] <hsivonen> (see my code :-)
# [15:40] <hsivonen> Philip`: just rewind the stack to the data state
# [15:40] * Philip` will try to finish these bits while still untainted, and then look at the code ;-)
# [15:41] <Lachy> hsivonen: that union jack example isn't particularly significant, since that description is completely inappropriate for how the flag was used.
# [15:41] <Philip`> (I'm not trying to do a practical implementation - mostly I just want pretty pictures and things)
# [15:41] <hsivonen> html5lib and my code are under the MIT license, it's not like looking at AT&T code :-)
# [15:42] <Philip`> I currently just want to represent the algorithm as described in the spec, disregarding the implementation details that everyone else worries about :-)
# [15:47] <MikeSmith> No commit-watchers mail since 28 June ... have there really been no changes, or is the list broken?
# [15:47] <hsivonen> MikeSmith: Hixie is doing research. no changes
# [15:47] <MikeSmith> OK
# [15:47] <MikeSmith> thanks
# [15:50] * Joins: rubys (n=rubys@cpe-075-182-064-252.nc.res.rr.com)
# [15:50] <rubys> annevk: you there?
# [15:52] <rubys> if you get a chance, can you look into removing from tests/test_parser.py the following line "if testName == "tests5": continue # TODO"?
# [15:53] * Quits: BenWard (i=BenWard@nat/yahoo/x-851c38bdf86ef319)
# [15:53] <hsivonen> ouch. the catch all end tag case "in body" has a set of 69 strings to test against...
# [15:55] <hsivonen> perhaps the tokens should come with a clever bitfield after all... instead of just interning
# [15:55] * Joins: BenWard (i=BenWard@nat/yahoo/x-86d43e2f7c62c229)
# [15:57] <hsivonen> or a lex sorted array with binary search. or something...
# [16:02] <Philip`> Does Java let you do binary searches for (interned) strings based on something like a pointer, rather than slowly comparing characters?
# [16:03] <Philip`> (I guess that might not be possible since the GC can move things around arbitrarily and won't maintain a consistent ordering, perhaps)
# [16:05] <hsivonen> Philip`: no, you only get to compare memory addresses for equality
# [16:06] <hsivonen> Philip`: however, I could have a hashtable that knew that all values are interned
# [16:07] <hsivonen> for the time being, I'm treating anything that goes beyond interning name and doing "foo" == name || "bar" == name || ... as a premature optimization
# [16:08] * Quits: Lfe (n=lfe@bergstroem.nu) ("leaving")
# [16:10] * Philip` wishes OCaml had better error reports than simply "Syntax error"
# [16:20] <Philip`> Oh, assuming there's never an EOF doesn't make the state transitions much simpler - there's only about three cases I can see where it makes a difference
# [16:31] * Joins: billmason (n=billmaso@ip156.unival.com)
# [16:31] <MikeSmith> hsivonen - is it true that currently with html5lib, given an arbitrary HTML document as source that it can construct a DOM from successfully, that DOM can't necessarily be re-serialized as well-formed XML?
# [16:31] <MikeSmith> Or anybody?
# [16:32] <rubys> it is rare, but true
# [16:32] <MikeSmith> (I realize html5lib is not hsivonen's implementation...)
# [16:32] <MikeSmith> rubys - OK
# [16:33] <rubys> it is possible to have entity or attribute names that aren't simple names, it is possible for comments to have two consecutive dashes in them, it is possible for strings to contain form feeds or other values that are illegal in XML.
# [16:33] <MikeSmith> ah
# [16:34] <Philip`> When I tried serialising a random collection of web pages as XML, a significant number (uh, I can't remember how much, but maybe 20% or so) became ill-formed XML
# [16:34] <rubys> other things (like matching up open and close tags) are taken care of by html5lib, and so are the overwhelming majority of common errors.
# [16:34] * Joins: tndH_ (i=Rob@83.100.252.160)
# [16:34] <rubys> 20% surprises me.
# [16:34] <rubys> are these public pages? Can you share an example?
# [16:36] <MikeSmith> but hsivonen's implementation (backend of his conformance checker), by its nature, is inherently capable of producing well-formed XML?
# [16:36] <MikeSmith> is that true?
# [16:36] <MikeSmith> I would think it'd need to be since he has XML tools in the toolchain for it
# [16:37] <MikeSmith> or maybe not
# [16:37] <Philip`> I never looked at the examples in any detail, so I'm not sure what the issues were, though I remember a few were just because of
# [16:37] <Philip`> http://www.toyota.com/ is an interesting one
# [16:37] <Philip`> since it has <spacer type"block" width="1" height="1"></spacer> which gets parsed as an attribute with a " in its name
# [16:38] <Philip`> http://krijnhoetmer.nl/irc-logs/whatwg/20070507#l-581 - hmm, apparently it was 25%
# [16:38] <Philip`> (just using the top thousand Yahoo search results for some boring word, if I remember correctly)
# [16:39] <rubys> html5lib has a sanitizer that removes unsafe or unknown markup. Our goal is to make that bullet proof.
# [16:39] <Philip`> I don't know how many of those issues were just caused by the html5lib toxml() being not very good
# [16:40] <Philip`> (Also I think some of the issues might have been that I didn't handle character encoding properly)
# [16:42] <rubys> If you are interested in producing XML, I would recommend the dom treebuilder
# [16:46] <Philip`> When I was looking at those things before, I was mostly interested in analysing real HTML documents and just avoiding the slowness of repeatedly parsing with html5lib by caching them in a nicer serialised format, but it seems XML isn't very suitable for that :-(
# [16:46] * Quits: wild_cfo (n=wild_c_f@ool-44c1bb48.dyn.optonline.net) ("This computer has gone to sleep")
# [16:46] * Quits: tndH (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
# [16:46] <rubys> what type of analysis?
# [16:48] <Philip`> Mainly looking for common usage of certain elements/attributes, like in http://canvex.lazyilluminati.com/misc/copyright.html and http://canvex.lazyilluminati.com/misc/summary.html
# [16:48] <rubys> your requirements are terribly unique, and I would like to work towards making a bullet proof conversion (possibly lossy in cases like spaces in attribute names) possible, and would appreciate test cases towards that end.
# [16:48] <Philip`> (and theoretically any other statistics on HTML documents, except I got distracted before getting around to scaling the system up to work on a reasonable sample)
# [16:49] <Philip`> ((for quite small values of 'reasonable'))
# [16:51] * Joins: hendry (n=hendry@kitten-x.com)
# [16:53] <annevk> his requirements are very relevant for the work the HTML WG and WHATWG are doing (fwiw)
# [16:53] <annevk> although they should be met by having a fast html5lib
# [16:53] <Philip`> I expect I'll get back to this analysis thing at some point, and I'll see if I can extract the cases that cause problems (since I expect it would be nice to be able to use standard XML tools on random documents safely, without having to stick an HTML frontend onto them)
# [16:54] <rubys> a fast html5ib ... which ultimately means a port to C
# [16:54] <rubys> annevk: can you scroll back and see my question about tests5?
# [16:54] <annevk> yeah, saw that
# [16:55] <annevk> thought they already worked
# [16:55] * annevk poners
# [16:55] * annevk ponders*
# [16:55] <rubys> that test passes, except for error checks, which you just enabled.
# [16:55] <rubys> no error is produced on EOF
# [16:55] <Philip`> I'm trying to write the easy part of the parsing algorithm in a language-agnostic manner, so it'll be nice if that works out :-)
# [16:57] <annevk> there should be no error either
# [16:57] <annevk> seems like a simple mistake in the test
# [16:59] * Joins: Ducki (i=Alex@dialin-145-254-186-124.pools.arcor-ip.net)
# [16:59] <rubys> if the tests were changed, then 'next if test_name == "tests5" # TODO' can be removed from ruby/tests/test_parser.rb too
# [17:00] <annevk> yeah, did all that a few minutes ago
# [17:01] <rubys> 'all that'? You changed the ruby test?
# [17:01] <annevk> oh, ruby
# [17:01] * Quits: Ducki__ (n=Alex@dialin-145-254-180-253.pools.arcor-ip.net) (Read error: 113 (No route to host))
# [17:01] <annevk> sorry
# [17:01] <annevk> I haven't played with ruby at all
# [17:03] <rubys> I'd work on a C port, but only if we had more people who were interested in maintaining the code. This business of multiple people making changes to the Python code and Sam ports the changes won't scale much further.
# [17:05] <annevk> if we have a C version we can just make Python and Ruby bindings, no?
# [17:06] <rubys> that could certainly be done
# [17:06] <Philip`> It's nice to have pure Python/Ruby/etc versions when people are unable/unwilling to compile and install C modules
# [17:07] <annevk> can't you make some .pyc version people can just use?
# [17:07] * annevk isn't really up to speed with C > Python mappings and how to work with them
# [17:08] <Philip`> (hence things like XML::Sax::PurePerl)
# [17:10] <Philip`> I think you probably need a .dll (or .so or whatever) if you want to use a C library in Python, and that will be specific to a certain processor architecture and OS and maybe other system libraries, which is a pain when people can't compile easily
# [17:10] <annevk> hmm, fair enough
# [17:11] <rubys> on the other hand, 99.99% of the people would choose to use a C binding to their favorite language over a native binding.
# [17:12] <annevk> http://lists.w3.org/Archives/Public/www-archive/2007Jul/0010.html ...
# [17:13] * tndH_ is now known as tndH
# [17:13] <annevk> rubys, people who care one bit about performance, indeed
# [17:14] <annevk> also, C bindings to an HTML5 parser should just be included by default in Python, Ruby, Java, etc.
# [17:14] <annevk> well, maybe not Java
# [17:15] <Philip`> Perl too :-)
# [17:15] <rubys> I'd also love to see the C parser actually used by products like Opera and/or Firefox.
# [17:15] <rubys> they could have their own treebuilders, of course; but the parser could be the same.
# [17:17] * Philip` wishes he could remember how to compute transitive closures (in a functional language)
# [17:18] <annevk> from what I heard from WebKit and Firefox architecture that might be quite tricky
# [17:19] <rubys> I'm not familiar with WebKit, but I have taken a peek at Firefox. Don't see why it would be tricky (I know, I know, famous last words...)
# [17:20] * annevk needs /ignore for e-mail clients
# [17:21] <annevk> rubys, maybe it's possible, they have done it for the XML parser after all...
# [17:23] <rubys> exactly... there is a part in the logic where you take in an input stream and produce a custom DOM implementation. Obviously, the input stream and DOM may vary from product to product, as would the tokenizer/parser error handing, but the logic could be pluggable.
# [17:24] <rubys> Imagine how nice it would be if Safari, Firefox, and Opera used the SAME tokenizer/parser?
# [17:24] <annevk> hmm, no parsing bugs to exploit!
# [17:24] <Philip`> They'd probably all use slightly different versions with different bug fixes, so it wouldn't be entirely perfect
# [17:25] <rubys> perfect? No. But a dramatic improvement over today.
# [17:26] <rubys> And each vendor is going to have to invest some work effort towards html5 compliance. This should reduce the work for everybody.
# [17:33] <Philip`> Are vendors planning to replace their existing HTML parser with a shiny new HTML5 one, or are they planning to just receive lots of bug reports and make lots of small fixes until they pass most of the tests, or are they not planning anything yet?
# [17:39] <annevk> I think WebKit is planning on fixing bugs
# [17:39] <annevk> they're pretty close for most cases anyway
# [17:39] <annevk> dunno about other browsers
# [17:45] <Philip`> Hmm, the state transition graph gets a bit big when I split out all the different content models
# [17:48] * Quits: jgraham (n=jgraham@81-86-222-233.dsl.pipex.com) (Read error: 110 (Connection timed out))
# [17:54] * Joins: tndH_ (i=Rob@83.100.252.160)
# [17:54] * Quits: tndH (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
# [17:54] * tndH_ is now known as tndH
# [18:00] * Joins: weinig (i=weinig@nat/apple/x-a6309fb9aa376651)
# [18:01] <Philip`> http://canvex.lazyilluminati.com/misc/states2.png
# [18:02] <annevk> ouch
# [18:02] <annevk> "HTML tokenizing. More trivial than it looks."
# [18:05] <Philip`> I think that's overestimating the possible transitions a little, since it assumes that whenever a tag token (either start or end) is emitted it could end up in any of the four content models
# [18:06] <Philip`> At least there's the nice DataState PLAINTEXT black hole at the bottom
# [18:06] <annevk> :)
# [18:17] <annevk> In the Live DOM Viewer in Internet Explorer the <!> sequence causes the DOM view to turn almost blank...
# [18:29] * Joins: tndH_ (i=Rob@83.100.252.160)
# [18:32] * Joins: h3h (n=w3rd@66-162-32-234.static.twtelecom.net)
# [18:33] * Quits: weinig (i=weinig@nat/apple/x-a6309fb9aa376651)
# [18:36] * Joins: weinig (i=weinig@nat/apple/x-204ff4e81de6ca4d)
# [18:36] * Quits: KevinMarks (n=KevinMar@c-76-102-254-252.hsd1.ca.comcast.net) ("The computer fell asleep")
# [18:37] * Joins: hasather (n=hasather@22.80-203-71.nextgentel.com)
# [18:38] <Philip`> It looks like my state transition thing agrees with the spec's comments about "This can only happen if the content model flag is set to the PCDATA state" etc, except for the bogus comment state where you have to do lots of slightly convoluted thinking to work out that it's correct
# [18:38] <Philip`> though, should the (non-bogus) comment states state that they can only happen when PCDATA, or is that obvious when left unstated?
# [18:47] * Quits: tndH (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
# [18:47] <Philip`> (I suppose it should also be obvious that the only state you can be in with PLAINTEXT is the data state)
# [18:48] <annevk> I'm not sure why the other cases actually state it, to be honest
# [18:48] <annevk> It makes it just more confusing for the cases where it's not
# [18:50] * Quits: tndH_ (i=Rob@83.100.252.160) (Read error: 110 (Connection timed out))
# [18:51] * Quits: Lachy (n=Lachy@124-168-24-114.dyn.iinet.net.au) (Read error: 110 (Connection timed out))
# [18:54] <zcorpan_> annevk: it's because comments where the leading "!--" and trailing "--" don't fit, you can't read .nodeValue in ie
# [18:54] <zcorpan_> annevk: i solved that by using a try/catch in dom2string
# [18:55] <zcorpan_> annevk: and emitting "" if reading .nodeValue fails
# [18:55] <annevk> k
# [18:56] <Philip`> Ooh, neat, the W3C validator says <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><title></title><table datapagesize=cheese><tr><td></table> is valid
# [18:56] <annevk> hehe
# [18:56] <zcorpan_> would be cool if the live dom viewer had an option to show the dom using dom2string_recursive
# [18:58] <zcorpan_> Hixie: yt?
# [18:59] * Joins: Ducki_ (n=Alex@dialin-212-144-055-172.pools.arcor-ip.net)
# [19:02] <annevk> zcorpan_, the real feature would be to make a mashup of http://james.html5.org/parsetree.html and your script
# [19:02] <annevk> zcorpan_, maybe just for the text input box
# [19:08] * Joins: aroben (n=adamrobe@17.203.15.248)
# [19:09] * Joins: Lachy (n=Lachy@203-158-59-119.dyn.iinet.net.au)
# [19:18] * Quits: Ducki (i=Alex@dialin-145-254-186-124.pools.arcor-ip.net) (Read error: 110 (Connection timed out))
# [19:29] * Joins: tndH (i=Rob@83.100.252.160)
# [19:30] * Quits: met_ (n=Hassman@r5bx220.net.upc.cz) ("Chemists never die, they just stop reacting.")
# [19:30] * Quits: BenWard (i=BenWard@nat/yahoo/x-86d43e2f7c62c229) ("Fades out again…")
# [19:47] * Joins: KevinMarks (i=KevinMar@nat/google/x-3d39f747c7a64a31)
# [19:51] * Joins: webben (i=benh@nat/yahoo/x-298224fddc481c77)
# [20:00] * Quits: hendry (n=hendry@kitten-x.com) (Read error: 113 (No route to host))
# [20:11] * Joins: hendry (n=hendry@kitten-x.com)
# [20:22] <Philip`> The tokeniser is much easier when I don't worry about actually implementing it, since I can just add a command like AppendHyphenToCommentToken and use it without caring about what it does
# [20:23] <Philip`> but I guess it'll all catch up with me when I do get around to the implementation bit :-(
# [20:27] <zcorpan_> Philip`: you're writing pseudo-code? :)
# [20:28] * Quits: tantek (n=tantek@adsl-63-195-114-133.dsl.snfc21.pacbell.net)
# [20:30] <Philip`> Yes :-)
# [20:31] <Philip`> (in a form that can be transformed into real code)
# [20:31] <Philip`> (but that just moves some of the work into the code that does the transformation)
# [20:32] <Philip`> (but it's a good excuse to learn OCaml anyway)
# [20:32] * Quits: Ducki_ (n=Alex@dialin-212-144-055-172.pools.arcor-ip.net) (Client Quit)
# [20:32] * Quits: MikeSmith (n=MikeSmit@eM60-254-202-189.pool.emobile.ad.jp) (Read error: 110 (Connection timed out))
# [20:33] * Joins: Ducki (n=Alex@dialin-212-144-055-172.pools.arcor-ip.net)
# [20:35] * Joins: MikeSmith (n=MikeSmit@eM60-254-197-94.pool.emobile.ad.jp)
# [20:37] * Joins: dbaron (n=dbaron@corp-242.mountainview.mozilla.com)
# [20:50] <Philip`> http://canvex.lazyilluminati.com/misc/states3.png - now with added doctype states, so I think it's got everything (and probably more bugs than before)
# [20:51] <Philip`> Oops, that's still got the EOF transitions...
# [20:52] <Philip`> Now it doesn't, so it's a bit prettier
# [20:57] <Philip`> Actually, I should probably tell it about parse errors too, so I can see if it's much simpler for conforming content
# [20:58] <zcorpan_> seems the algorithm in https://bugzilla.mozilla.org/attachment.cgi?id=188040 only has one flaw, which is before step 1: match the value against the list of color keywords
# [20:58] * Quits: weinig (i=weinig@nat/apple/x-204ff4e81de6ca4d) (Read error: 110 (Connection timed out))
# [21:00] * Joins: Ducki_ (i=Alex@dialin-145-254-186-098.pools.arcor-ip.net)
# [21:00] <annevk> zcorpan_, nice interop mess
# [21:01] * Quits: Ducki (n=Alex@dialin-212-144-055-172.pools.arcor-ip.net) (Read error: 104 (Connection reset by peer))
# [21:01] <zcorpan_> now i'll just see which keywords are supported, and if that differs from the keywords supported in css
# [21:01] * Quits: Codler (n=Codler@84-218-7-44.eurobelladsl.telenor.se) ("- nbs-irc 2.21 - www.nbs-irc.net -")
# [21:03] * Quits: Charl (n=charlvn@c1-228-9.wblv.isadsl.co.za) ("Leaving")
# [21:07] <Philip`> http://canvex.lazyilluminati.com/misc/states4.png - hmm, it does look much cleaner when you don't allow parse errors
# [21:11] <zcorpan_> wow. ie supports lightgrey but not lightgray. quite the opposite to all other gr(a|e)ys
# [21:12] <zcorpan_> Philip`: you can't get into the bogus states if you don't allow parse errors, right?
# [21:12] <Philip`> http://en.wikipedia.org/wiki/HTML_colors says lightgrey too
# [21:15] <zcorpan_> could there be other keywords supported that aren't listed in css3-color ?
# [21:15] <Philip`> zcorpan_: Yep - there's nothing leading into those states in the diagram, but I didn't bother stripping them out
# [21:15] * Joins: jcgregorio (n=chatzill@209.79.152.140)
# [21:15] <zcorpan_> Philip`: ok
# [21:16] <Philip`> zcorpan_: I believe I looked in IE's .exe for colour names, and it didn't have any that weren't the standard set which CSS3 and every other browser includes
# [21:16] <zcorpan_> Philip`: ok. thanks
# [21:17] <Philip`> Oh, that was IE3
# [21:18] <Philip`> but I don't think they've changed it since then
# [21:18] <Philip`> since they just copied it from NN2
# [21:19] <Dashiva> Philip`: What if you colored the transition arrows depending on whether the transition requires a parse error or not?
# [21:19] <annevk> might be interesting to test DarkSeaGreen
# [21:19] <annevk> whether IE has the X11 or .Net impl
# [21:19] * Quits: jcgregorio (n=chatzill@209.79.152.140) (Client Quit)
# [21:19] * annevk got that from the wikipedia page
# [21:20] <zcorpan_> annevk: darkseagreen is in css3-color
# [21:20] <Philip`> Dashiva: That sounds worth doing
# [21:20] <zcorpan_> ah
# [21:20] <Philip`> though what about transitions that can be both parse errors and not?
# [21:21] <Dashiva> a third color, or both?
# [21:22] <Philip`> Hmm, I'll just draw two arrows, because then I won't have to change my code :-)
# [21:22] <annevk> some more arrows wouldn't hurt
# [21:22] <annevk> it's not always clear what the direction is :)
# [21:23] <Dashiva> Maybe put an arrowhead on the middle of the arrow too
# [21:24] <Philip`> Hmph, colour PNGs are huge
# [21:25] <zcorpan_> annevk: ie uses x11
# [21:25] <Philip`> http://canvex.lazyilluminati.com/misc/states5.png
# [21:25] * Joins: weinig (i=weinig@nat/apple/x-6e3b9ac0c16bd8f1)
# [21:28] <Philip`> Hmm, I don't think I can make Graphviz draw arrow heads except at the end
# [21:30] <annevk> zcorpan_, so how do you test which color is used? some color picker?
# [21:31] <zcorpan_> annevk: .bgcolor returns the rgb color
# [21:32] <zcorpan_> er, .bgColor
# [21:32] <annevk> cool, automated testing
# [21:34] <zcorpan_> http://simon.html5.org/test/html/parsing/color-attributes/keywords/
# [21:36] <zcorpan_> i haven't sent anything to the list about color attributes yet, have i
# [21:38] <annevk> prolly not: http://www.google.com/search?q=inurl:whatwg-whatwg+color
# [21:39] * Philip` wonders if he could automatically generate tests to cover all the possible state transitions
# [21:39] <annevk> in http://simon.html5.org/test/html/parsing/color-attributes/ you can change Opera to none too
# [21:39] <annevk> Philip`, that'd be most useful
# [21:40] <annevk> Philip`, format: http://html5lib.googlecode.com/svn/trunk/testdata/tokenizer/ pretty please :)
# [21:40] <zcorpan_> annevk: ah. cool.
# [21:41] <annevk> Philip`, or maybe in the tree construction format...
# [21:41] <annevk> Philip`, that would prolly be useful too especially for testing browsers
# [21:42] <Philip`> The tree construction format probably wouldn't work too well when I don't have a tree constructor, unless I'm missing some point...
# [21:43] <annevk> ah, if you want to debug your own code, then no
# [21:44] <Philip`> Ah, okay - I think it would be nice to have something I could use for just tokeniser tests
# [21:44] <annevk> then use the funky json format :)
# [21:45] <annevk> I wonder if that can be used in some meaningfull way on browsers too... prolly not
# [21:45] * Joins: webben_ (i=benh@nat/yahoo/x-33bf928752899e80)
# [21:45] <Philip`> though I don't know how to cope with the issue that the tree construction stage can affect the tokeniser's content model, when there's no tree construction stage
# [21:45] * Quits: webben (i=benh@nat/yahoo/x-298224fddc481c77) (Read error: 104 (Connection reset by peer))
# [21:45] <annevk> see escapeFlag.test and contentModelFlags.test
# [21:45] <Philip`> Incidentally, "content model flag" is a confusing name since most flags don't have four states...
# [21:46] <Philip`> Oh, right - that looks useful :-)
# [21:48] * Quits: webben_ (i=benh@nat/yahoo/x-33bf928752899e80) (Client Quit)
# [21:49] <Philip`> Shouldn't the test format include attributes on end tags, since the tokeniser is meant to emit them?
# [21:50] * Joins: bzed (n=bzed@dslb-084-059-100-221.pools.arcor-ip.net)
# [21:50] <annevk> the tokeniser doesn't emit them
# [21:51] <annevk> Hixie, those stats on AAA are useful! thanks
# [21:51] <Philip`> "Start and end tag tokens have a tag name and a list of attributes, each of which has a name and a value." "When an end tag token is emitted with attributes, that is a parse error." - it sounds like they are emitted
# [21:52] <annevk> oh, ok
# [21:52] <Hixie> annevk: which ones?
# [21:52] <annevk> Hixie, the ones you pasted in IRC earlier; how many times duplication is hit etc.
# [21:53] <annevk> although I'd love to see more detail :)
# [21:53] <Hixie> ah yes
# [21:53] <Hixie> i'll be posting more in due course
# [21:54] * Joins: jgraham (n=jgraham@81-86-213-61.dsl.pipex.com)
# [22:12] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net) (Read error: 104 (Connection reset by peer))
# [22:15] * Quits: maikmerten (n=maikmert@T72ea.t.pppool.de) ("Leaving")
# [22:15] * Joins: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
# [22:20] <annevk> jgraham, I've been thinking about removing all the classes in html5parser.py
# [22:20] <annevk> having said that, it hasn't been more than thinking
# [22:21] <annevk> I'm not sure if we would actually gain anything from removing them and moving to a bunch of if/else statements as opposed to dictionary based method invocations
# [22:21] <annevk> what we have now might actually be faster
# [22:21] <rubys> why remove them then?
# [22:21] <jgraham> annevk: I would image waht we have now is faster
# [22:22] <jgraham> (although I would need metrics to be sure, of course)
# [22:22] <jgraham> I think the time would be better spent on Chtml5lib
# [22:22] <annevk> prolly
# [22:23] <rubys> If I did the port, who would contribute to it?
# [22:23] <jgraham> rubys: I guess it would be one way for me to finally learn C :)
# [22:24] <rubys> I took a look at it, and porting it to C++ would probably take about a week. To C would be another week.
# [22:24] <annevk> if I learn how to work with C on Ubuntu (besides learning to work with C in general) I would probably contribute
# [22:24] <jgraham> (which is a way of saying I would love to contribute fixes but I don't feel confident in designing it)
# [22:24] <annevk> not sure how much time I would invest on the python version afterwards
# [22:24] <rubys> I would simply port the existing design. After it is working, it could be optimized, refactored, etc.
# [22:25] <bewest> in that case why not profile the python version and move slow parts to C?
# [22:25] <annevk> hmm, how are we going to handle <noscript>?
# [22:26] <annevk> bewest, how is that better?
# [22:26] <jgraham> That sounds great to me; I simply don't have enough C experience to know how best to implement things that are currently e.g. lists in python in C
# [22:26] <annevk> we can prolly steal some ideas from Hixie's and hsivonen's impl
# [22:26] <bewest> annevk: maybe it's not :/
# [22:26] <rubys> C++ has a standard library. Going to C next would mean reimplementing those concepts.
# [22:26] <jgraham> bewest: It's not like there's one slow bit, it's the overhad of doing things many times
# [22:26] <bewest> yeah
# [22:26] <jgraham> e.g. many function calls
# [22:27] <Philip`> I'd be interested to see if my C++ tokeniser implementation could actually work in practice
# [22:27] <jgraham> Philip`: the O'Caml one?
# [22:27] <annevk> rubys, if we're going to do it C might be better if we get more detailed control over things like the inputstream
# [22:27] <Philip`> jgraham: Yes
# [22:27] <annevk> Question: scripting is enabled or disabled?
# [22:27] <annevk> we don't have any tests for <noscript> atm...
# [22:28] <Philip`> (The C++-generating part is totally broken now, but http://canvex.lazyilluminati.com/misc/states5.png is generated from exactly the same data as the C++ tokeniser would be)
# [22:31] <annevk> I'll assume that scripting is enabled for now
# [22:31] <annevk> I suppose at some point we can provide a switch and enable/disable tests conditionally
# [22:33] <Philip`> Could the test format be made to handle scripts modifying the input stream?
# [22:34] * Joins: wild_cfo (n=wild_c_f@ool-44c1bb48.dyn.optonline.net)
# [22:34] * Quits: wild_cfo (n=wild_c_f@ool-44c1bb48.dyn.optonline.net) (Client Quit)
# [22:35] <Philip`> You couldn't really expect parsers to all have script interpreters, but you could define that the tests can have <script>document.write("<p>")</script> (for some arbitrary JSON-encoded string) and the test harness can push those strings back into the input stream, to make sure the parser copes properly
# [22:40] <annevk> at least for tree construction that's feasible
# [22:40] <annevk> I was thinking of maybe offering #document-scripting-disabled at some point which provides an alternate tree and prolly also #errors-scripting-disabled
# [22:41] <Hixie> just so everyone is aware and doesn't wonder if i died or something, i'm going to be on vacation for 3 weeks starting sunday
# [22:41] <gsnedders> I'll make sure to ask if you've died.
# [22:42] <hasather> Hixie: have fun :)
# [22:42] <Hixie> i'll try! :-)
# [22:42] <gsnedders> more seriously, where are you going?
# [22:42] <Hixie> europe, east coast, various places around there
# [22:43] <Hixie> apparently spending a lot of time in layovers at schipol
# [22:43] <Hixie> which doesn't bode well for my luggage
# [22:43] <annevk> yeah, it does that to you
# [22:44] <gsnedders> I'm probably not getting of of the UK this summer
# [22:44] <jgraham> gsnedders: Me neither (although I have been to various conferences abroad)
# [22:45] <gsnedders> I'm going off down to Cambridge, but that's it. Probably going to Paris with my sister + her husband over the October holidays, though
# [22:46] <jgraham> I assure you that Cambridge is lovely in every way. As long as you don't like hills. Or even slight rises.
# [22:46] <jgraham> And, preferably, have a thing for tourists and punt touts
# [22:46] <gsnedders> my grandmother lives in Cambridge, I've been plenty of times. Doesn't seem that hilly to someone from Scotland, though.
# [22:47] <gsnedders> I should try actually punting again…
# [22:47] <jgraham> It's really not that hilly. That why you can't like hills if you want to like Cambridge
# [22:47] * jgraham wants to move away just to get some hills
# [22:47] <gsnedders> jgraham: come here!
# [22:48] <gsnedders> [Fife]
# [22:49] <jgraham> Fife would be nice. How are the employment opportunities though?...
# [22:50] * Joins: hober (n=ted@unaffiliated/hober)
# [22:50] <gsnedders> No idea. I'm too young to know such things :)
# [22:50] <jgraham> And I, sadly, am almost old enough to have to care :(
# [22:51] * gsnedders goes back to showing how young he is by looking up university entrance requirements
# [22:51] <Dashiva> I feel old now
# [22:52] * Quits: ROBOd (n=robod@86.34.246.154) ("http://www.robodesign.ro")
# [22:58] <Philip`> You have to put up with all the students in Cambridge too :-p
# [22:58] * Quits: othermaciej (n=mjs@dsl081-048-145.sfo1.dsl.speakeasy.net)
# [22:59] <gsnedders> hmmm… AAAAB at the min. for Higher entrance into Oxford
# [22:59] * gsnedders marks English as the B
# [22:59] <Philip`> though I suppose they're usually outnumbered by tourists
# [22:59] <gsnedders> Philip`: the terms aren't overly long at Cam/Oxf
# [23:00] * Quits: Ducki_ (i=Alex@dialin-145-254-186-098.pools.arcor-ip.net) (Read error: 113 (No route to host))
# [23:00] <Philip`> 3 * 8 weeks, with three months off for the summer vacation :-)
# [23:00] * Joins: Ducki_ (n=Alex@dialin-145-254-186-098.pools.arcor-ip.net)
# [23:00] <gsnedders> Philip`: which gives plenty of time for tourists to rule supreme :)
# [23:00] <gsnedders> (I couldn't myself remember whether it was 8v10 or 10v12)
# [23:01] <Philip`> It's nice during the exam term when they stop all the tourists coming into the colleges
# [23:02] <gsnedders> I don't think I've ever been there at the time, due to school
# [23:02] <Philip`> (Er, but I have no idea how many colleges do that)
# [23:02] <gsnedders> (and nowadays I have exams at the same time)
# [23:02] <gsnedders> Philip`: all do, IIRC
# [23:05] <jgraham> Philip`: the quatity tourits+students is roughly conserved over the whole year
# [23:06] <Hixie> cute, this http://triin.net/2006/06/12/Coding_practices_of_web_pages page refers to my 2005-12 study
# [23:07] <Hixie> wow, the numbers he gets are very similar to the numbers i got in that study
# [23:07] <Hixie> ncie
# [23:07] <Hixie> nice
# [23:07] <Hixie> (comparing http://code.google.com/webstats/2005-12/pages.html to http://triin.net/2006/06/12/HTML)
# [23:08] <Hixie> even the oddities are present in both studies
# [23:08] <Hixie> that's awesome
# [23:09] * Joins: csarven (n=nevrasc@modemcable081.152-201-24.mc.videotron.ca)
# [23:18] * Quits: annevk (n=annevk@pat-tdc.opera.com) (Read error: 110 (Connection timed out))
# [23:18] * Joins: webben (n=benh@91.84.193.157)
# [23:19] <hsivonen> MikeSmith: my Java impl has configurable XML 1.0 compat
# [23:21] <hsivonen> MikeSmith: for various features you can choose to be conforming to HTML5 (and potentially violate XML 1.0), not to violate XML 1.0 by treating violations as fatal errors or not violate XML 1.0 by being non-conforming to HTML 5 and making infoset-altering coercions
# [23:22] * Quits: weinig (i=weinig@nat/apple/x-6e3b9ac0c16bd8f1) (Read error: 110 (Connection timed out))
# [23:23] <hsivonen> rubys: it might be a good idea to do an independent implementation in C. I believe Mike Day has already started one. I chose to do an independent implementation in Java using only test cases from html5lib in order to make a library that makes the most of Java instead of trying to map Pythonic stuff to Java
# [23:24] * Quits: Ducki_ (n=Alex@dialin-145-254-186-098.pools.arcor-ip.net) (Read error: 110 (Connection timed out))
# [23:31] <MikeSmith> hsivonen - thanks for the info
# [23:37] <hsivonen> MikeSmith: to elaborate a bit: the SAX interface makes it possible for me to violate the interface contract in a way that exposes all of HTML5 in a way that may violate XML 1.0. The XOM interface, by design, won't allow it. When using a DOM impl meant for XML, some of the violation may not pass, either.
# [23:38] <hsivonen> MikeSmith: so the non-XML stuff will be available through SAX (which I'm treating as the native interface) and custom DOM impls if someone cares to make one
# [23:41] * Joins: weinig (i=weinig@nat/apple/x-980a2e775f61ddd9)
# [23:42] <rubys> hsivonen: the Ruby implementation is meant to make the most of Ruby, and diverges in a number of significant ways.
# [23:42] <rubys> I did use the Python implementation as a starting point, but only as that, and only because it saved me some time.
# [23:43] <hsivonen> rubys: ok. anyway, I suggest pinging Mike Day to avoid duplicating what he has already been doing
# [23:44] <rubys> that's why I've been advocating putting implementations into one place (html5lib)... so as to minimize the "search time" it takes to find out the actual current state of an implementation.
# [23:45] <rubys> what is the license, for example, of Mike's work?
# [23:46] <hsivonen> rubys: the reason why I put the Java impl in a different repo is to keep it together with the rest of the conformance checker which in turn is there in order to keep it together with the schema project
# [23:46] <hsivonen> rubys: MIT/expat, IIRC
# [23:47] <hsivonen> rubys: MIT/expat seems to be the convention for HTML5 parsers :-)
# [23:47] <rubys> ... eventually it will likely no longer be "the" (as in "the only") Java implementation. :-)
# [23:48] <hsivonen> rubys: do you mean because of the repo choice or in general?
# [23:50] <rubys> the two parsers that are in html5 have essentially zero required dependencies, and very few optional dependencies. I'd like to see a similar effort in PHP, Java, C#, and C.
# [23:51] <hsivonen> rubys: my Java impl depends on a couple of my utility classes and ICU4J
# [23:51] <hsivonen> rubys: putting the utility classes in one jar with the parser is not a big deal
# [23:51] <rubys> i tried downloading it once. that was not the impression I got. But perhaps I was wrong.
# [23:52] <hsivonen> rubys: making ICU4J optional for reduced correctness is not a big deal, either
# [23:52] <hsivonen> rubys: do you mean you downloaded the parser that I'm currently working on or the conformance checker way back when you mentioned it in your blog comments
# [23:53] <rubys> way back when
# [23:53] <hsivonen> rubys: when my parser implementation is in a state where it can actually be used, I intend to offer a binary jar that doesn't require you to run the whole conformance checker build
# [23:53] <hsivonen> (and the conformance checker build is now much easier, too)
# [23:54] <hsivonen> rubys: the parser I'm now writing is not the prototype parser you saw way back when
# [23:54] <rubys> Cool. Is there a single place where implementations can be found?
# [23:55] * Joins: othermaciej (n=mjs@17.255.106.198)
# [23:55] <rubys> If not, can we make such a list on http://wiki.whatwg.org/wiki/ ?
# [23:55] <hsivonen> rubys: dunno if the WHATWG wiki is up to date
# [23:55] <hsivonen> rubys: in any case, I suggest that we link to each other whenever someone makes something runnable in a new language
# [23:56] <hsivonen> (my tree builder is not runnable just yet)
# [23:56] <rubys> How about this: I'll update html5lib to point to http://wiki.whatwg.org/wiki/Implementations
# [23:56] <hsivonen> makes sense
# [23:56] <hsivonen> svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
# [23:56] <hsivonen> in case you are interested
# [23:57] <hsivonen> depends on the util module in the same repo, ICU4J and Java5
# [23:57] * Quits: weinig (i=weinig@nat/apple/x-980a2e775f61ddd9)
# [23:58] * Quits: MikeSmith (n=MikeSmit@eM60-254-197-94.pool.emobile.ad.jp) ("Less talk, more pimp walk.")
# [23:58] * Parts: hasather (n=hasather@22.80-203-71.nextgentel.com)
# [23:58] <rubys> are there any tests?
# [23:59] <hsivonen> rubys: you need to check out html5lib separately to get test data
# [23:59] <hsivonen> rubys: there are test harnesses for running html5lib encoding tests and tokenization tests
# [23:59] <hsivonen> (tree builder harness will follow in due course)
# [23:59] * Joins: MikeSmith (n=MikeSmit@eM60-254-197-94.pool.emobile.ad.jp)
# Session Close: Fri Jul 06 00:00:00 2007

The end :)