Options:
- # Session Start: Wed Jul 11 00:00:00 2007
- # Session Ident: #html-wg
- # [00:00] <zcorpan> would be cool
- # [00:00] <Philip`> It'd probably be pretty much identical to the C++ version, except for replacing 'bool' with 'var' and rewriting all the support code around the edges
- # [00:00] <zcorpan> although tree construction with the dom core apis won't work in some cases :|
- # [00:03] <hsivonen> zcorpan: other than doctype?
- # [00:04] <zcorpan> yeah. attributes that start with =. element names that contain &. etc
- # [00:05] <zcorpan> raise exceptions if you try to create them
- # [00:13] * Quits: tH (Rob@87.102.67.108) (Quit: ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508])
- # [00:13] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [00:14] <hsivonen> http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C%21DOCTYPE%20html%3E%0A%3Ctable%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0Afoo%3C%21--%20--%3E%20%3C%21--%20--%3Ebar%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C/table%3E%0A%3Ctable%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C%21--%20--%3E%3C%21--%20--%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C/table%3E%0A
- # [00:14] <hsivonen> Check out Gecko.
- # [00:15] <hsivonen> Safari is closer to spec.
- # [00:16] <hsivonen> Safari seems to do what I proposed on list
- # [00:18] <zcorpan> iirc there was another proposal to use a flag. when you hit something that causes foster reparenting the flag is set to true, and then whitespace and comments are also fosterparented. the flag is set to false when you hit a table-related element again
- # [00:18] <zcorpan> or some such
- # [00:18] <zcorpan> i.e. what firefox does afaict
- # [00:18] * Joins: gavin (gavin@74.103.208.221)
- # [00:19] <zcorpan> does safari drop text nodes with only whitespace between comments?
- # [00:20] <zcorpan> only in table
- # [00:20] * Quits: hyatt (hyatt@17.203.14.191) (Quit: hyatt)
- # [00:20] <hsivonen> zcorpan: I was testing WebKit trunk, actually
- # [00:21] <hsivonen> zcorpan: release Safari doesn't put comments in the DOM
- # [00:22] <zcorpan> hsivonen: 3 beta for windows does
- # [00:22] <zcorpan> hsivonen: but i see that trunk doesn't drop the text node
- # [00:25] * Joins: hyatt (hyatt@17.203.14.191)
- # [00:27] <Philip`> Hooray, JS tokeniser works
- # [00:27] <Philip`> ...for a rather limited set of inputs
- # [00:29] <Philip`> http://canvex.lazyilluminati.com/misc/parser/tokeniser_js.html
- # [00:29] <Philip`> Probably only works in Firefox because I used uneval since I didn't want to actually put any effort into it
- # [00:29] <Philip`> but at least it handles tags and attributes alright
- # [00:30] <zcorpan> Philip`: nice!
- # [00:32] <Philip`> If anyone wants to make it work decently, please feel free :-)
- # [00:34] * gsnedders is tempted to ask what the point of that is
- # [00:36] <Philip`> It could (if it had the rest of the parser) be like http://james.html5.org/parsetree.html except without needing any server-side code
- # [00:36] <Philip`> I'm still not sure what the point of that would be, though
- # [00:37] <Philip`> But HTML can't be considered a complete platform for application development until you can write a whole web browser in it, so an HTML parser is highly useful for that
- # [00:37] <gsnedders> If I wanted an HTML parser in an HTML document, why not just use the browser's own parser
- # [00:37] <zcorpan> comparing the browser's tree with the spec
- # [00:37] <zcorpan> to find bugs in the spec
- # [00:37] <Philip`> It could provide the solution to backward-compatibility problems!
- # [00:38] <Philip`> Instead of <!doctype html>, just get people to use <script src=http://w3.org/2009/html5></script> as the magic line at the top of their file
- # [00:39] <Philip`> HTML5 UAs can detect that and remove it, while all others will execute the script
- # [00:39] <zcorpan> -_-
- # [00:39] <gsnedders> 2009? feeling optimistic? :P
- # [00:39] <Philip`> which can read in the rest of the document content, then use the JS HTML parser to construct the DOM
- # [00:40] <jgraham> Thus creating perfect interoperability and *really* slow sites
- # [00:40] <Philip`> It's a flawless plan
- # [00:40] <jgraham> :)
- # [00:40] <Philip`> That'll just encourage users to upgrade their browsers
- # [00:40] <jgraham> Good point
- # [00:40] <zcorpan> or leave your site
- # [00:41] <jgraham> Does Safari really pick up a charset attribute on the html element? AFAICT Opera and Firefox don't
- # [00:42] <gsnedders> doesn't appear to
- # [00:42] <gsnedders> (saf 3 beta os x)
- # [00:43] <zcorpan> doesn't per my testing either
- # [00:43] * jgraham hypothesises that Robert Burn's text editor added a BOM or something
- # [00:43] <gsnedders> anyhow, g'nite (4realz)
- # [00:44] <jgraham> goodnight
- # [00:44] <zcorpan> nn
- # [00:44] <gsnedders> (yes, I had to throw a "4realz" in)
- # [00:44] <gsnedders> back to spec reviewing tomorrow (yay! :\)
- # [00:44] * zcorpan too
- # [00:45] * gsnedders waits to be asked next term, "What did you do over the summer holidays?"
- # [00:45] <gsnedders> Why, review HTML 5, of course!
- # [00:45] <jgraham> I thought you were going to bed! ;)
- # [00:45] * zcorpan too
- # [00:45] * jgraham should sleep soon as well
- # [00:46] <gsnedders> jgraham: 4realz. :D
- # [00:48] * Parts: hasather (hasather@80.203.71.22)
- # [00:50] * Quits: heycam (cam@203.214.115.243) (Ping timeout)
- # [00:52] * zcorpan added http://simon.html5.org/test/html/parsing/encoding/002.htm
- # [00:53] <Philip`> I can't work out how to make my browsers stop treating every document as UTF-8, regardless of what meta-charset or actual characters they have in them...
- # [00:54] <zcorpan> Philip`: opera?
- # [00:55] * zcorpan finds it interesting that Robert says he doesn't know whether or not research has been made when he has been told that research has been made, been pointed to the relevant test cases, and to the relevant part of the spec
- # [00:56] <Philip`> Opera/FF/IE/Safari
- # [00:56] <Philip`> Maybe I configured my web server to do something...
- # [00:57] <Philip`> Oh, yes, that would explain it
- # [00:57] <zcorpan> AddDefaultCharset utf-8 ? :)
- # [00:59] <Philip`> Yes :-(
- # [00:59] <Philip`> (Well, actually, AddCharset UTF-8 .html)
- # [01:00] <zcorpan> ok
- # [01:00] * Philip` renames his file to .htm
- # [01:01] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [01:02] <zcorpan> nn
- # [01:02] * Quits: hyatt (hyatt@17.203.14.191) (Quit: hyatt)
- # [01:02] * Parts: zcorpan (zcorpan@84.216.41.183)
- # [01:05] <Philip`> When I'm not doing anything stupid, I also agree that Safari 3 on Windows (and FF2 and IE7) does ignore <html charset> but respect <meta charset>
- # [01:08] * Parts: billmason (billmason@69.30.57.156)
- # [01:34] * Joins: sbuluf (xtyh@200.49.140.181)
- # [01:36] * Joins: heycam (cam@130.194.72.84)
- # [02:10] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Ping timeout)
- # [02:16] * Joins: karl (karlcow@128.30.52.30)
- # [02:20] * Joins: Lionheart (robin@66.57.69.65)
- # [02:20] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [02:24] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
- # [02:25] * Joins: gavin (gavin@74.103.208.221)
- # [02:36] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [02:42] <karl> http://en.wikipedia.org/wiki/Usage_share_of_web_browsers
- # [02:45] <MikeSmith> karl - I look forward to seeing how these publishers of browser market-share data handle browsers running on devices other than PCs
- # [02:46] <karl> yep me too
- # [02:46] <MikeSmith> we will see the day when the numbers of people browsing from desktop PC is eclipsed by those browsing from other devices
- # [02:46] <karl> I wonder if Safari iPhone has a different user agent for example
- # [02:46] <MikeSmith> I would think it does
- # [02:47] <karl> MikeSmith: it is the case somewhere… I have read something about this recently
- # [02:47] <mjs> versions are different, basic stuff is the same
- # [02:47] <karl> damn I don't remember where
- # [02:47] <karl> good evening, mjs
- # [02:47] <MikeSmith> mjs - I know I owe you a follow-up on the accesskey thread
- # [02:48] <MikeSmith> and cheers for you guys adding node-set() support to Webkit
- # [02:48] <karl> http://economictimes.indiatimes.com/Indians_prefer_to_surf_Net_on_the_go/articleshow/2183516.cms
- # [02:48] <karl> Indians prefer to surf Net on the go
- # [02:49] <MikeSmith> despite whatever others may think negatively about client-side XSLT, a lot of developers like it -- having another option
- # [02:49] <karl> "The number of Indians accessing internet through their mobile phones is now over three times those using the PC to connect to the Web. India has 9.27 million internet subscribers as against 31.30 million users who access internet through their mobile handsets—GSM or CDMA—to read and reply to mails, download content and for online transactions, according to latest figures released by telecom regulator Trai. "
- # [02:49] <MikeSmith> karl - haven't read that article but I would guess that's because many don't have Net access at home
- # [02:49] <MikeSmith> but instead many go to Net cafe and such
- # [02:50] <karl> yes it is hard to know real stats on that.
- # [02:50] <karl> There is an infrastructure issue too.
- # [02:50] <mjs> MikeSmith: I think the whole discussion needs to be restarted with a clear statement of what problems it's trying to solve, and how it will avoid the pitfalls of accesskey so far
- # [02:50] <mjs> MikeSmith: I think the desktop implementations of it so far are laughably bad and the lack of use in desktop content reflects that
- # [02:51] <karl> Mobile has developed a lot in Africa, because it is easier to distribute than having to rely on cables. More flexible.
- # [02:51] <MikeSmith> mjs - yeah, I pretty much agree with that. I don't personally see such a compelling use case for accesskey on desktop
- # [02:51] <mjs> (it seems clear to me that if tapping control and then hitting F does something totally different than hitting either F or control-F, that's going to be a usability problem)
- # [02:52] <MikeSmith> I think this is a good example of importance of considering carefully the consequences of anything new we spec
- # [02:52] <MikeSmith> because in the end, after it gets deployed, we have to live with it
- # [02:53] <MikeSmith> I would vote for trying to spec accesskey based on how it is most often currently used in the wild
- # [02:54] <MikeSmith> not on what anybody hopes or wants it to be used for
- # [02:59] <karl> interesting benchmarks - http://krijnhoetmer.nl/irc-logs/html-wg/20070710#l-239
- # [03:00] * MikeSmith is trying to remember Opera's desktop numbers and wonders that users of Wii browsers will eventually account for a quite large percentage of Opera user base worldwide
- # [03:00] <Philip`> I would test the Ruby one too if I even vaguely knew how to write Ruby :-)
- # [03:02] <MikeSmith> Ruby is supposed to be so easy and wonderful to program in that you don't really need to know how to write it. it just happens, like magic
- # [03:05] <MikeSmith> I do like writing in Ruby, actually. When I was at a previous employer I used Ruby to write a prototype for a Web app for doing some xml-rpc interaction with and engine that indexed user e-mail message stores
- # [03:05] <karl> MikeSmith: certainly in the mines of King Solomon
- # [03:05] <Philip`> I've already written C++, Java, JavaScript, Python, Perl and OCaml today, and now it's 2am, so I think my brain will explode if I look at yet another language using the same symbols in different ways :-(
- # [03:05] <MikeSmith> heh
- # [03:06] <Philip`> and yet another way to find the length of a list
- # [03:06] * Philip` wonders why no two languages ever seem to do that in the same way
- # [03:07] <MikeSmith> Philip` - I've heard you mention OCaml before but don't know much about it ... what's it good for?
- # [03:07] <Philip`> It's good for making one's brain explode
- # [03:07] <Philip`> or at least it takes a bit of getting used to
- # [03:08] <Philip`> (but I had to learn SML at university a while ago, and OCaml uses mainly the same concepts)
- # [03:10] <Philip`> It seems to be quite useful for manipulating complex data structures - I have something like http://canvex.lazyilluminati.com/svn/tokeniser/cpp.ml to create trees of C++ code and then print them prettily, and the functions just do pattern-matching to respond to the appropriate types
- # [03:12] <MikeSmith> Philip` - I se
- # [03:12] <MikeSmith> see
- # [03:13] <Philip`> It's also a (mostly) functional language, so I can have an implementation the tokeniser that is side-effect-free so I can easily tell that it's not messing with things it shouldn't be messing with
- # [03:14] <Philip`> and the tokeniser is just based around a function which takes a state value and returns the next state value, which makes it easy to e.g. fork the character stream and see how it responds differently to different inputs
- # [03:15] <MikeSmith> side-effect free? hey, maybe you can write a tokenizer in XSLT 1.0 - that'd be fun, given how painful even most simple string processing is in XSLT ... side-effect freeness was one of the design goals that James Clark had for XSLT
- # [03:15] <MikeSmith> OCaml has a good regular expressions library?
- # [03:16] <Philip`> I think it does have a regular expression library but I haven't used it at all
- # [03:17] * Joins: olivier (ot@128.30.52.30)
- # [03:18] <MikeSmith> I have some professional acquaintances who are doing a lot of work using Erlang
- # [03:19] <Philip`> Mainly I'm using OCaml because it's interesting to try, and I'll probably have to end up writing stuff in it for the next three years or so, so I might as well learnt it now :-)
- # [03:19] <Philip`> *learn
- # [03:21] <MikeSmith> makes sense
- # [03:22] * MikeSmith wonders what draft Rob Burns is talking about that he says in reply to jgraham that he says, "I wonder whether anyone reads the draft"
- # [03:23] <MikeSmith> Rob Burns wrote a draft of something? if he did, I gotta admit I have not read it
- # [03:24] <Philip`> I assumed he meant the HTML5 draft
- # [03:24] <Philip`> I think jgraham probably has read that, though
- # [03:46] * Joins: Zeros (Zeros-Elip@67.154.87.254)
- # [04:08] <MikeSmith> hsivonen - I'm wondering if it would be too early to try to gather some people together for specific discussion about building conformance checkers
- # [04:10] <MikeSmith> It's worrying to think about how much work it will be to build conformance checkers other than the one you have built, in other languages
- # [04:10] <MikeSmith> and how we can try to ensure that they report the same results
- # [04:11] <MikeSmith> because we have some grief waiting if we create a situation where multiple HTML5 conformance checkers are in common use, but reporting different results for a check of the same document
- # [04:15] <MikeSmith> almost suggests that what me might end up needing is a spec that describes conformant behavior for conformance checkers ...
- # [04:27] <Zeros> MikeSmith, that was kind of Hixie's intent, that there is no single "correct" checker, just the spec.
- # [04:27] <Zeros> And then everyone gets to implement their own if they want one
- # [04:27] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [04:28] <MikeSmith> Not that it matters, but I don't think Hixie's intent was that we end up with multiple conformance checkers reporting conflicting results for the same page
- # [04:29] <MikeSmith> I'm obviously not suggesting we have one "correct" checker
- # [04:29] <Zeros> As I understand it he wanted people to work together to improve each of the validators
- # [04:30] <MikeSmith> jesus
- # [04:30] <Zeros> If reporting the same results is all that matters then we might as well have a single validator.
- # [04:30] <MikeSmith> no, obviously not
- # [04:31] <Zeros> it ensures the validation results are always the same
- # [04:31] <Zeros> which is precisely what you want :)
- # [04:31] <MikeSmith> Zeros, not it's not, genius
- # [04:31] <Zeros> Wow, you get nasty fast.
- # [04:31] <MikeSmith> I do when you try to put words into my mouth
- # [04:32] <Zeros> you said "ensure that they report the same results"
- # [04:32] <Zeros> that was your words
- # [04:32] * Joins: gavin (gavin@74.103.208.221)
- # [04:34] <Zeros> heh, nice talking to you though
- # [04:35] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
- # [04:58] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
- # [05:01] * Quits: Lionheart (robin@66.57.69.65) (Connection reset by peer)
- # [05:29] * Quits: dbaron (dbaron@63.245.220.242) (Quit: 8403864 bytes have been tenured, next gc will be global.)
- # [05:39] * Joins: Lionheart (robin@66.57.69.65)
- # [06:34] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [06:39] * Joins: gavin (gavin@74.103.208.221)
- # [07:24] * Quits: olivier (ot@128.30.52.30) (Quit: This computer has gone to sleep)
- # [07:50] * Joins: olivier (ot@128.30.52.30)
- # [07:55] * Joins: heycam (cam@203.214.115.243)
- # [08:04] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Client exited)
- # [08:14] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
- # [08:36] * Joins: NiColasS (nicolas@213.7.50.164)
- # [08:36] <NiColasS> I am using html from now on !
- # [08:40] <NiColasS> which doctype is more recommended ?
- # [08:41] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [08:44] * Quits: NiColasS (nicolas@213.7.50.164) (Quit: NiColasS)
- # [08:46] * Joins: gavin (gavin@74.103.208.221)
- # [08:54] <hsivonen> jgraham: re test case format: "all the attributes must be given, in alphabetical order". Can we change that to lexicographically sorted by UTF-16 code unit?
- # [08:55] <hsivonen> I expect it already means that
- # [09:09] <hsivonen> MikeSmith: I'm really bad at estimating how much time a given piece of a conformance checker takes
- # [09:10] <hsivonen> MikeSmith: but sure, I can discuss stuff
- # [09:11] <MikeSmith> hsivonen - cool
- # [09:15] <MikeSmith> hsivonen - would it be accurate to say that your checker has a mechanism for making it possible to expose parse events for non well-formed XML as a stream of SAX events that XML tools can handle?
- # [09:15] <MikeSmith> (I don't have your thesis in front of me now, so going from memory...)
- # [09:16] <hsivonen> It has a mechanism for exposing HTML as SAX events and a mechanism for exposing XML as SAX events
- # [09:16] <hsivonen> I make no effort to fix bad stuff labeled as XML
- # [09:16] <MikeSmith> hsivonen - OK
- # [09:17] <hsivonen> and the HTML mechanism is being replaced
- # [09:17] <MikeSmith> anyway, I think that technique is particularly valuable
- # [09:17] <MikeSmith> replaced?
- # [09:17] <MikeSmith> you mean you will be rewriting the parsing algorithm?
- # [09:17] <MikeSmith> to match the spec?
- # [09:18] <hsivonen> yes.
- # [09:19] <hsivonen> Actually, instead of "will" it is "almost have"
- # [09:21] * Joins: tH (Rob@87.102.67.108)
- # [09:22] <MikeSmith> hsivonen - OK
- # [09:24] <MikeSmith> so I would hope we could take that technique for making it possible to handle non-WF HTML and use it in other languages as well
- # [09:25] <MikeSmith> for one thing, it would free validators from dependence on nsgmls
- # [09:26] <MikeSmith> or any dependence on SGML tools at all
- # [09:27] <hsivonen> what I'm writing now handles non-WF HTML
- # [09:28] <MikeSmith> hsivonen - understood. what I meant to say was that others could write other implementations using the same technique you developed
- # [09:29] <hsivonen> (I didn't develop the technique. I learned it from John Cowan.)
- # [09:30] <MikeSmith> ah
- # [09:31] <MikeSmith> I gotta admit I didn't pay much attention to TagSoup before
- # [09:31] <hsivonen> MikeSmith: my understanding is that Henry S. Thompson is working on a generic but vocabulary-specifically configurable schema-guided soup-XML parser using ideas from TagSoup
- # [09:32] <MikeSmith> but I remember your thesis saying the the TagSoup approach wasn't really suitable for a validation/conformance checker because its principle aim it just "fix" (or whatever term) the source
- # [09:32] <MikeSmith> not to report errors in the source
- # [09:33] <hsivonen> MikeSmith: whereas Anne is working on one that is generic and not per-vocabulary configurable
- # [09:33] <MikeSmith> OK
- # [09:33] <hsivonen> MikeSmith: yeah, TagSoup is for apps that don't care about errors
- # [09:33] <hsivonen> MikeSmith: Petr Nálevka added error reporting to TagSoup, though
- # [09:34] <hsivonen> TagSoup doesn't conform to HTML 5, of course
- # [09:34] <hsivonen> it does its own thing
- # [09:35] * Quits: karl (karlcow@128.30.52.30) (Quit: Where dwelt Ymir, or wherein did he find sustenance?)
- # [09:36] * Quits: olivier (ot@128.30.52.30) (Quit: Leaving)
- # [09:44] * Joins: zcorpan (zcorpan@84.216.42.141)
- # [10:07] <jgraham> hsivonen: What's the difference between "alphabetic" and "lexicographically sorted"?
- # [10:17] <hsivonen> jgraham: case, for one
- # [10:17] <hsivonen> jgraham: also, it is unambiguous for non-a-to-z characters
- # [10:18] <jgraham> Does lexicographically sorted just mean sorted by code point index?
- # [10:18] <hsivonen> jgraham: yes.
- # [10:18] <jgraham> OK. That sounds sensible.
- # [10:18] <hsivonen> jgraham: except since we are both using UTF-16, we probably want to sort by code unit instead of code point
- # [10:18] <hsivonen> (matters for astral stuff)
- # [10:18] <jgraham> OK
- # [10:20] * Quits: spleen_blender (notgonnage@72.16.243.238) (Connection reset by peer)
- # [10:21] <hsivonen> edit the wiki
- # [10:22] * Quits: sbuluf (xtyh@200.49.140.181) (Ping timeout)
- # [10:25] <hsivonen> edited
- # [10:25] <hsivonen> that is, I edited
- # [10:26] <hsivonen> didn't mean to suggest that you edit it
- # [10:26] <hsivonen> anyway...
- # [10:27] <jgraham> Ah, I was just about to do it :)
- # [10:28] <jgraham> I actually think the description of the format there isn't very close to what we have implemented
- # [10:30] <hsivonen> jgraham: any major changes that I should be aware of?
- # [10:30] <jgraham> For example You don't need the #errors section to follow the #data section
- # [10:30] <jgraham> Basically the implementation we have assumes:
- # [10:30] <jgraham> #data starts a new test
- # [10:31] <jgraham> There is a known list of subsections of test data which all start #something
- # [10:32] <hsivonen> eek. that's more complicated than absolutely necessary :-(
- # [10:32] <jgraham> hsivonen: I can make changes if you want
- # [10:33] <hsivonen> jgraham: I'd prefer the order of the subsections to be predictable. even better if all subsections were always there
- # [10:33] <jgraham> but a goal is to have the format slightly extensible so we can add extra (optional) sections to the tests like #innerHTML for the fragment case
- # [10:34] <hsivonen> but most of all, I'd prefer the the sections to be considered to end with LF# instead of LF#foo
- # [10:35] <jgraham> Well I guess there's no problem saying "Any line that starts '#' is a new subsection"
- # [10:36] <hsivonen> good
- # [10:36] <jgraham> It just means you can't have test data with that string in
- # [10:36] <hsivonen> not much of a loss given that # isn't that interesting in test data
- # [10:37] <hsivonen> at the start of a line
- # [10:37] <jgraham> Indeed
- # [10:39] <hsivonen> Java API designers have a lot to learn from Python
- # [10:39] <hsivonen> 4 lines to instantiate an XML parser
- # [10:40] <hsivonen> another 4 to instantiate an XML serializer
- # [10:40] <hsivonen> going with the default
- # [10:48] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [10:53] * Joins: gavin (gavin@74.103.208.221)
- # [11:01] * Joins: ROBOd (robod@86.34.246.154)
- # [11:06] * Quits: mjs (mjs@17.255.104.239) (Quit: mjs)
- # [11:27] <hsivonen> I now have something that runs and dumps a tree in the html5lib format
- # [11:27] <hsivonen> still a lot of known brokennes to fix
- # [11:30] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
- # [11:49] * Joins: zcorpan (zcorpan@84.216.42.141)
- # [11:57] <gsnedders> only 12 messages overnight on public-html… the traffic is really going down
- # [11:59] <MikeSmith> gsnedders - please don't talk about it. If you mention that, it'll jinx it and next thing you know we'll have a new thread about the indeting style of the source for the spec
- # [12:00] <gsnedders> MikeSmith: or worse — versioning.
- # [12:00] <MikeSmith> heh :9
- # [12:00] <MikeSmith> we need a FAQ really
- # [12:00] <MikeSmith> a preemptive FAQ
- # [12:01] <MikeSmith> "Please no versioning discussion ... please if you think the we should force authors to write only in valid XML, find another place to discuss it ... " etc.
- # [12:02] <gsnedders> if you want x, y, z see public-xhtml2
- # [12:03] <gsnedders> it's amazing how many people on public-html have a view that goes against what is in-scope for this WG, and is in-scope for XHTML2
- # [12:03] <MikeSmith> Or to put it another way, "These topics have been discussed at great length already, and it's not likely that whatever you might have to say about it is going change the worlds and magically bring everything to a resolution"
- # [12:04] <MikeSmith> gsnedders - yeah
- # [12:04] <gsnedders> Or, to put it in another way again, "STFU."
- # [12:04] <MikeSmith> heh
- # [12:05] <MikeSmith> anyway, amazing to me that people don't see that you can author in whatever language you want and transform your content to HTML
- # [12:05] <MikeSmith> can create your own perfect authoring language that exactly meets whatever criteria you have
- # [12:06] <MikeSmith> and then try to convince others to use that for authoring if you want
- # [12:06] <MikeSmith> but it does not need to be directly supported in browsers
- # [12:06] <zcorpan> phew! document.title was insane
- # [12:07] <zcorpan> (and fun)
- # [12:08] * MikeSmith reads zcorpan message on document.title
- # [12:12] <gsnedders> zcorpan: what do you expect? it's HTML!
- # [12:13] <zcorpan> gsnedders: oh sure :)
- # [12:13] <MikeSmith> zcorpan - so current spec doesn't match behavior of any current browser?
- # [12:13] <zcorpan> MikeSmith: right
- # [12:14] <zcorpan> my proposal doesn't either, but is closer
- # [12:15] <MikeSmith> yeah
- # [12:15] <MikeSmith> it's great to be getting these detailed spec reviews posted to public-html
- # [12:16] <MikeSmith> I think a side effect of it'l be to try to set higher expectations about what't appropriate for the list
- # [12:18] <zcorpan> yeah
- # [12:19] <zcorpan> and encourage others to do detailed reviews
- # [12:20] <MikeSmith> I think at some point we need to take a hard look at what has actually been accomplished by having a W3C working group participating in work on the HTML5 spec that might not have been accomplished by having the discussion take place only on the WHATWG list
- # [12:21] <MikeSmith> but not sure what we can point to so far as far as that goes
- # [12:21] <zcorpan> headers="" research
- # [12:22] <MikeSmith> OK, true that
- # [12:23] <MikeSmith> I think another anticipated benefit of having discussion within a W3C context was that it would facilitate and encourage participation from Microsoft ...
- # [12:23] <zcorpan> yeah
- # [12:25] <gsnedders> currently at 1212 tests for #numbers
- # [12:25] <gsnedders> still parts of the algorithm not tested, though
- # [12:25] <gsnedders> *algorithms
- # [12:26] <zcorpan> man, if we keep up this rate, we will have 20,000 tests at 2010 for sure
- # [12:27] <zcorpan> that doesn't mean we will have complete implementations though
- # [12:27] <gsnedders> zcorpan: what I'm doing for the numbers is a massive advantage for the numbers though: use each test input data for each number algorithm
- # [12:27] <gsnedders> (which massively increases the amount of invalid input, and checking of the error handling)
- # [12:30] <zcorpan> ok, document.body is next
- # [12:31] * gsnedders likes the fact that Bungie still cares about those who don't use Windows (insofar as they make sure their site works on Safari, they encode many videos in MPEG standards as well as WMV, etc.)
- # [12:34] <hsivonen> zcorpan: good stuff on the list
- # [12:39] <zcorpan> hsivonen: thanks
- # [12:46] * Joins: mjs (mjs@64.81.48.145)
- # [12:47] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
- # [12:47] * Joins: Sander (svl@80.60.87.115)
- # [12:47] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
- # [12:48] * Joins: zcorpan (zcorpan@84.216.42.141)
- # [12:51] * Joins: alexf (alejandro@85.152.42.1)
- # [12:51] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
- # [13:13] * Joins: myakura (myakura@58.88.37.26)
- # [13:18] <gsnedders> 20,000 tests probably won't be enough, actually :P
- # [13:19] * gsnedders moves on to lists of integers
- # [13:24] <Philip`> Are you manually verifying the output for each of these 1212 tests? :-)
- # [13:24] <gsnedders> no
- # [13:25] <gsnedders> each was written by hand, and not relying on any implementation, though
- # [13:25] <gsnedders> it's over 1500 now anyway :)
- # [13:32] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [13:37] * Joins: gavin (gavin@74.103.208.221)
- # [13:41] * Philip` wonders why http://www.whatwg.org/specs/web-apps/current-work/multipage/section-entities.html sorts semicolons before end-of-strings
- # [13:42] <Philip`> (since that doesn't seem like a natural sorting order, and it's not an order that helps with the way I'm trying to implement it)
- # [13:53] * Joins: zcorpan (zcorpan@84.216.42.141)
- # [13:53] * Joins: karl (karlcow@128.30.52.30)
- # [13:57] * Joins: olivier (ot@128.30.52.30)
- # [14:05] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
- # [14:11] * Philip` gets down to two-and-a-half test failures
- # [14:21] <Philip`> Aha, only half a test failure now
- # [14:33] <Philip`> Oh, but I have an untested bug :-(
- # [14:34] <Philip`> but it looks like I'm not the only one
- # [14:35] * Quits: myakura (myakura@58.88.37.26) (Quit: Leaving...)
- # [14:36] * Quits: Sander (svl@80.60.87.115) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [14:37] <Philip`> hsivonen: I believe <h a='¬i'> should return an attribute with value "¬i", but you give "\u????i" (for some value of ? that I don't know off the top of my head)
- # [14:39] <jmb> U+00AC, I'd expect
- # [14:42] <Philip`> Ah, that number sounds familiar
- # [14:42] * Philip` fixes that bug in his own code
- # [14:48] * Joins: zcorpan (zcorpan@84.216.42.141)
- # [15:02] * Quits: karl (karlcow@128.30.52.30) (Quit: This computer has gone to sleep)
- # [15:18] * Quits: olivier (ot@128.30.52.30) (Quit: Leaving)
- # [15:40] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [15:45] * Quits: Lionheart (robin@66.57.69.65) (Connection reset by peer)
- # [15:45] * Joins: gavin (gavin@74.103.208.221)
- # [16:33] * Joins: billmason (billmason@69.30.57.156)
- # [16:45] <hsivonen> Philip`: does what you said about ¬i still apply after "* Philip` fixes that bug in his own code"?
- # [16:49] <Philip`> hsivonen: Yes - I was handling that case incorrectly, and your code was doing it incorrectly too, and then I fixed my code, but yours is still incorrect
- # [16:50] <hsivonen> Philip`: ok. thanks. Is there a tokenizer-level test case about this in the html5lib repo?
- # [16:51] <Philip`> I added one to http://html5lib.googlecode.com/svn/trunk/testdata/tokenizer/test1.test ("Entity in attribute without semicolon ending in i")
- # [16:53] <hsivonen> Philip`: ok. thanks. I'll take a look tomorrow
- # [16:53] <Philip`> (Probably not the best description, since it's more relevant that it's almost but not quite 'notin', but I'm no good at describing these things concisely)
- # [16:54] <hsivonen> Philip`: btw, do you have a smarter implementation approach for it than what I have?
- # [16:56] <gsnedders> hmmm… "10" as a list of integers returns [1]
- # [16:58] <Philip`> I think my code is about the same as what you did (which is not coincidental) - it has a sorted array of entity names, then finds the range of names which match the first character (using a binary search, with STL doing all the hard work), then finds the subrange that match the second character, then repeats until the range has size zero/one
- # [16:58] <Philip`> (remembering any complete matches which it finds along the way)
- # [16:59] * Joins: spleen_blender (notgonnage@72.16.243.238)
- # [17:00] <Philip`> That seems to be generally sensible, since it never reads more characters than are required, and it doesn't waste loads of memory (e.g. on a trie)
- # [17:04] <Philip`> (http://canvex.lazyilluminati.com/svn/tokeniser/tokeniser.cpp at around where it says "entityNames")
- # [17:05] <gsnedders> Actually implementing the algorithms and having > 1500 test cases is finally paying off
- # [17:18] <gsnedders> these algorithms have amazingly few bugs
- # [17:25] * Philip` creates 1011 tokeniser test cases
- # [17:25] <gsnedders> DanC: feeling brave saying you'll publish all three documents against objections? I don't want to hear the mailing list when you do.
- # [17:25] <Philip`> ...and I've found one bug, though I don't know which implementation is the buggy one
- # [17:25] <DanC> it's my job
- # [17:26] <gsnedders> DanC: heh. so many people will complain. probably end up people acting my age.
- # [17:27] <Philip`> Hmm, html5lib agrees with me
- # [17:27] <Philip`> Input: "<z/0 <"
- # [17:27] <Philip`> Question: How many parse errors?
- # [17:27] <gsnedders> is that an opening or closing tag?
- # [17:28] <gsnedders> the former, I assume?
- # [17:29] <Philip`> Oops, the difference is not just parse errors
- # [17:29] <Philip`> I get ["ParseError", "ParseError", ["StartTag", "z", {"0": "", "<": ""}]]
- # [17:30] <Philip`> hsivonen's says: ["ParseError","ParseError",["StartTag","z",{"0":""}],"ParseError",["Character","<"]]
- # [17:30] * Philip` tries to work out what's happening
- # [17:32] <Philip`> hsivonen: afterAttributeNameState does a "case '<':" but the spec doesn't say anything about handling < in that state
- # [17:38] <zcorpan> did before i think
- # [17:41] * Philip` makes another 2115 test cases
- # [17:41] <Philip`> (Am I winning yet?)
- # [17:42] <zcorpan> Philip`: are you shitting test cases? :)
- # [17:42] <Philip`> Bah, I didn't find any new bugs that time
- # [17:44] <hsivonen> zcorpan: my girlfriend tests water purifiers. she uses that kind of test material. ;-)
- # [17:44] <hsivonen> Philip`: < noted. will have a look tomorrow
- # [17:45] <zcorpan> hsivonen: :)
- # [17:47] * Philip` tries another 8145
- # [17:48] <Philip`> Whoops, I think zombie processes killed my test generator
- # [17:48] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [17:48] <spleen_blender> lol, Z
- # [17:53] * Joins: gavin (gavin@74.103.208.221)
- # [17:56] <Philip`> I just find that <-in-attribute-name case lots of times, and no other visible bugs
- # [17:56] <Philip`> (Er, that should be "<-in-attribute" since it matters around attribute values too)
- # [17:58] <Philip`> It would be easier to test html5lib if its tokeniser output format hadn't totally changed and made every test fail
- # [18:05] <Philip`> ...and if svn didn't completely freeze solid whenever I tried accessing the html5lib repository
- # [18:10] * Joins: zcorpan_ (zcorpan@84.216.42.141)
- # [18:10] <Philip`> Ah, good, it's not my fault, it's just Google that's broken
- # [18:10] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
- # [18:54] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
- # [18:59] * Joins: ROBOd (robod@86.34.246.154)
- # [19:01] * Joins: dbaron (dbaron@63.245.220.242)
- # [19:07] * Parts: alexf (alejandro@85.152.42.1)
- # [19:12] * Joins: Lionheart (robin@198.86.248.1)
- # [19:16] * Joins: edas (edaspet@88.191.34.123)
- # [19:27] <gsnedders> hsivonen: can you try running "-a" through the list of integers algorithm?
- # [19:31] <gsnedders> hsivonen: actually, that's wrong. "-" is what I'm interested in.
- # [19:54] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [19:59] * Joins: gavin (gavin@74.103.208.221)
- # [20:06] <zcorpan_> DanC: i might be here on irc during the telecon tomorrow, but in any case: i'm willing to help with test suite organization
- # [20:12] <DanC> ah. interesting.
- # [20:14] * Quits: xover (xover@193.157.66.5) (Ping timeout)
- # [20:15] <DanC> any ideas on how to organize tests, zcorpan_ ?
- # [20:16] <DanC> I'm interested in tests materials that (a) aid developers in building good software, and (b) aid users in judging software and reporting problems
- # [20:17] <DanC> stuff that captures issues that people care about with objective results
- # [20:18] <DanC> the GRDDL spec has a much smaller scope, but we came up with a few dozen tests and we have test results in machine-readable form from a handful of implementations. http://www.w3.org/2001/sw/grddl-wg/td/test_results
- # [20:19] <DanC> I expect we'll need several different kinds of tests for HTML
- # [20:19] <DanC> we'll be able to automate some parts more than others, I expect
- # [20:25] * Joins: Sander (svl@80.60.87.115)
- # [20:27] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
- # [20:31] <gsnedders> too… much… spec…
- # [20:32] <zcorpan_> DanC: haven't thought much about it yet
- # [20:33] <zcorpan_> DanC: although we will have thousands of tests
- # [20:33] <zcorpan_> (or already have)
- # [20:34] <DanC> too much spec for what?
- # [20:34] <gsnedders> DanC: to review
- # [20:35] <zcorpan_> gsnedders: concentrate on one thing at a time :)
- # [20:35] <gsnedders> zcorpan_: I am
- # [20:35] <gsnedders> zcorpan_: It still seems endless, though
- # [20:36] <gsnedders> some day I'll misspell "microsyntaxes" in the subject line of one of my emails…
- # [20:36] * Joins: xover (xover@193.157.66.5)
- # [20:36] <DanC> I find it hard to believe that a BNF or regex isn't easier to specify and review than english prose for stuff like microsyntaxes. oh well.
- # [20:38] <gsnedders> DanC: a lot of the requirements would end up being English prose anyway
- # [20:38] <DanC> for example?
- # [20:38] <zcorpan_> DanC: perhaps we could just check in all tests at http://code.google.com/p/html5/
- # [20:38] <gsnedders> DanC: all the various times when you exist the ratios algorithm. you'd end up with so many alternatives in BNF or regex
- # [20:38] <gsnedders> *exit
- # [20:39] <DanC> what's a few more alternatives? this is a rather mature part of computer science.
- # [20:40] <gsnedders> DanC: take a look at #ratios, it isn't overly long, but I can't think of many easy ways of expressing that
- # [20:42] <DanC> I don't see anything that won't fit in a regex
- # [20:42] <gsnedders> It won't be an overly simple one though
- # [20:42] <DanC> so?
- # [20:42] <gsnedders> I'd rather have prose than complex regex
- # [20:42] <DanC> oh well.
- # [20:43] <DanC> you're doing the work, not me.
- # [20:44] <DanC> zcorpan_, hosting at code.google.com might work, as long as we can keep a copy in w3.org somewhere too. does code.google.com offer an rsync interface?
- # [20:44] <DanC> I'd rather use a decentralized version control system like hg or bzr or git
- # [20:44] <gsnedders> DanC: and anne said that I was the person to ask if he ever needed an overly complex regex :P
- # [20:48] <gsnedders> DanC: ^[^.0123456789]*([0123456789]+\.[0123456789]*|[0123456789]*\.[0123456789]+|[0123456789])(<unicode character class Zs>)*((%|٪|﹪|%|‰|‱)[^0123456789]*|[^.0123456789]*([0123456789]+\.[0123456789]*|[0123456789]*\.[0123456789]+|[0123456789])[^0123456789%٪﹪%‰‱]*)$
- # [20:48] <gsnedders> DanC: I think that expresses the algorithm…
- # [20:49] <DanC> written out as BNF, it's probably quite straightforward
- # [20:49] <DanC> since, for example, [0123456789] gets factored out as <digit>
- # [20:50] <gsnedders> *DIGIT "." +DIGIT / +DIGIT "." *DIGIT / DIGIT covers a floating point number in ABNF, I think
- # [20:50] <gsnedders> It'll be simpler than URI's ABNF for certain though
- # [20:51] <DanC> when you said "a lot of the requirements would end up being English prose anyway" I thought you were saying that there are constraints that can't be expressed in BNF. I don't see any so far.
- # [20:52] <gsnedders> ratios probably isn't the best of examples
- # [20:53] <DanC> I think it's worth publishing BNF for these things, even if it has to be a separate document. I've got a handful of volunteers for the formalization task.
- # [20:53] <gsnedders> dates would end up being verbose if you wanted to be exact (RFC3339's ABNF allows hours > 24, minutes > 59, seconds > 60)
- # [20:54] <gsnedders> I've got nothing against publishing some sort of BNF, but I'd rather the prose were the only normative part of the standard
- # [20:54] <DanC> true, capturing leap year rules in regex's isn't worthwhile. \d\d\d\d-\d\d-\d\d plus some prose constraints is a happy medium.
- # [20:55] <gsnedders> I liked BNF more before I started dealing with URIs and IRIs.
- # [20:55] <DanC> 3.2.3.4. Ratios doesn't motivate any of the complexity.
- # [20:56] <DanC> the regex at the end of the URI spec works much better than the BNF. URIs aren't parsed top-down like programming languages; they're chopped up piece by piece
- # [20:57] <gsnedders> they aren't that nice when you do try and parse them without using regex
- # [20:58] <gsnedders> </complete:understatement>
- # [20:58] <DanC> URI syntax is particularly horrible, and it took a long time to figure out the bounds of the standard. (it's still ongoing).
- # [20:58] <zcorpan_> DanC: don't know
- # [21:00] <DanC> I wonder where this %|٪|﹪|%|‰|‱ stuff came from. Surely no cows blazed any path like that.
- # [21:00] <gsnedders> DanC: it's allowed as content of elements
- # [21:01] <DanC> yes, but why bother with ‱ ? is that really worthwhile? why go beyond one % character?
- # [21:01] <gsnedders> hmm… URIs predate me :\ (though I think any standards regarding URL/URIs are younger)
- # [21:01] <gsnedders> DanC: the arabic one is really used by arabic people.
- # [21:03] <DanC> wild. I think it's best to return to my "I don't care what design you come up with, as long as there are plenty of tests and the implementors are willing to pass them" mode.
- # [21:03] * DanC needs lunch
- # [21:03] * DanC is late for a telcon :-/
- # [21:27] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
- # [21:41] <Philip`> Alas, I don't find any interesting bugs in the Python html5lib with my ~8K tokeniser tests :-(
- # [21:43] * Joins: briansuda (briansuda@85.220.95.76)
- # [21:47] * gsnedders adds with great glee to his commit message: "This is enough to test every algorithm within #numbers in the revision we're testing."
- # [21:47] <gsnedders> 1890 tests (inc. 315 ignored — the percentages and dimensions section which in the spec is TBW)
- # [21:49] <Philip`> Oops, they are actually interesting bugs
- # [21:50] * Philip` looks at them
- # [21:52] <Philip`> "<!doctype html \u000D"
- # [21:53] <Philip`> "<z \u000D"
- # [21:53] * Philip` sees a pattern
- # [21:53] <gsnedders> what is U+000D? CR?
- # [21:54] <gsnedders> if so, are you just testing the tokeniser, or the input stream as well (as they are dealt with in there)?
- # [21:54] <Philip`> Yep, CR
- # [21:55] <Philip`> This includes the input stream (since everyone seems to implement that as kind of part of the tokeniser)
- # [21:55] <Philip`> and that CR isn't followed by an LF, so an LF should be emitted
- # [21:55] <hsivonen> Philip`: zapped the < case in after attribute name. looks like I missed when Hixie zapped it from the spec.
- # [21:55] <hsivonen> Philip`: thanks
- # [21:56] <hsivonen> s/missed/missed it/
- # [21:56] <Philip`> and that LF is whitespace and gets skipped over, until the EOF is hit
- # [21:57] <hsivonen> gsnedders: Do I have an implementation of list of integers somewhere?
- # [21:57] <hsivonen> gsnedders: If my memory serves me correctly, I used a big regexp--not the algorithm
- # [21:58] <hsivonen> gsnedders: if the algorithm is for stuff like area coordinates
- # [21:58] <gsnedders> hsivonen: I was making an assumption that you did somewhere in the conformance checker
- # [21:58] * hsivonen looks at the spec
- # [21:58] <gsnedders> but yes, things like @coords
- # [21:59] <hsivonen> it has been a while since I have touched those parts
- # [21:59] <gsnedders> hsivonen: I'm just checking I haven't gone wrong somewhere. It's the one issue I've found I'm least sure about.
- # [21:59] <hsivonen> gsnedders: OK. I didn't implement the algorithm. I just took a hard look at it and wrote a regexp that is supposed to accept the same strings
- # [22:01] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
- # [22:01] <hsivonen> gsnedders: http://syntax.whattf.org/relaxng/embed.rnc
- # [22:02] <gsnedders> hsivonen: so all you do is check for errors, therefore don't have a result
- # [22:02] <hsivonen> gsnedders: yeah
- # [22:02] <hsivonen> gsnedders: I did implement the ratio algorithm, though
- # [22:03] <hsivonen> Philip`: It seems I forgot to make separate entity tables for attributes
- # [22:03] * Quits: Sander (svl@80.60.87.115) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
- # [22:07] <hsivonen> doh. I no separate tables needed
- # [22:07] <hsivonen> s/I//
- # [22:07] <hsivonen> I keep forgetting what code I have written and why
- # [22:07] * Joins: gavin (gavin@74.103.208.221)
- # [22:09] * Joins: Lionheart (robin@66.57.69.65)
- # [22:10] <Philip`> hsivonen: "<z/0=<"
- # [22:10] <Philip`> or I guess "<z x=<" except I haven't tested that particular case
- # [22:10] <Philip`> results in the tag being closed, instead of < in the attribute value
- # [22:11] <Philip`> hsivonen: The problem I had with attribute entities is that I was checking the last consumed character, instead of the character after the longest entity match, so I just had to fix that to examine the correct character
- # [22:13] <spleen_blender> lol hsiv, story of my life
- # [22:14] <zcorpan_> gsnedders: the TBW markers are out of sync
- # [22:14] * Philip` needs to find a way to run tests in his tokeniser without starting a new process for every test case
- # [22:16] <hsivonen> Philip`: do you mean I still have a "<" bug to fix?
- # [22:16] <Philip`> hsivonen: Yes - < in attribute-value-state
- # [22:17] <Philip`> Uh
- # [22:17] <Philip`> Before-attribute-value-state?
- # [22:17] <Philip`> Something like that
- # [22:17] <Philip`> Ah, yes, it is that
- # [22:18] <hsivonen> Philip`: < fix checked in for before attr val
- # [22:18] <hsivonen> Philip`: thank you
- # [22:23] <Philip`> hsivonen: I can't find any more bugs now :-(
- # [22:24] <hsivonen> Philip`: nice
- # [22:25] <Philip`> Now I just have to wait until the spec changes and spawns a new set of bugs
- # [22:25] <hsivonen> Philip`: well, my charset sniffing is not up to date
- # [22:26] <hsivonen> Philip`: and I won't fix it for a while
- # [22:27] <zcorpan_> Philip`: i think the html5lib tests are a step before the spec. the <title><!--&--></title> case
- # [22:28] <hsivonen> hmm. looks like I fail two encoding tests now
- # [22:30] <Philip`> Oh, maybe I should look at non-PCDATA at some point
- # [22:33] <hsivonen> oops. assertions fail when I remember to turn them on...
- # [22:40] <hsivonen> bah. my assertion was on the wrong line
- # [22:42] <jgraham> Philip`: Did you say you found a html5lib bug?
- # [22:42] <Philip`> jgraham: See http://html5lib.googlecode.com/svn/trunk/testdata/tokenizer/test4.test as of about two seconds ago
- # [22:43] <Philip`> Quite a few of those fail in html5lib for various reasons
- # [22:43] <Philip`> Like...
- # [22:43] <Philip`> unusual characters after a CR
- # [22:44] <Philip`> (Er, wait a minute, just trying to remember)
- # [22:44] <jgraham> So mostly input stream related?
- # [22:45] <Philip`> ...and non-BMP characters, though that's possibly just an issue with the JSON handler (since JSON is meant to do \x????\x???? surrogate pairs)
- # [22:46] <Philip`> ...and uppercase/lowercase tag/attribute names (though I saw you said you had a patch for that already)
- # [22:46] <Philip`> ...and the number of parse errors when an attribute is triplicated(?) instead of just duplicated
- # [22:47] <jgraham> I see 8 faliures
- # [22:47] <Philip`> ...and attributes on end tags
- # [22:47] <Philip`> ...and I think that's all
- # [22:47] <Hixie> gsnedders: the problem with BNF or regexp is that they don't explain the error handling properly, usually. BNF could work for defining the author requirements in some cases, i guess, though i'm not convinced that would be better than prose, and, more importantly, once you have a BNF people are way too tempted to use it to define the parsing.
- # [22:47] <Hixie> DanC: see above also
- # [22:48] <Philip`> jgraham: Do you have local modifications? (SVN seems to have totally broken tokeniser-testing at the moment, so I assume you're not just using that)
- # [22:48] <gsnedders> Hixie: in the case of the common microsyntaxes the error handling is normally rather consistent though, and could be put simply
- # [22:48] <jgraham> Philip`: I'm using svn (my local modifications shouldn't affect this at all)
- # [22:48] <jgraham> Do you have simplejson installed?
- # [22:49] * DanC tunes in...
- # [22:49] <Philip`> jgraham: Also I put "ignoreErrorOrder":true on some tests where the error order is undefined and the test code should ignore differences
- # [22:49] <DanC> oh... BNF. never mind. whatever is convenient for the editor and reviewers is fine by me.
- # [22:49] <Philip`> (because the errors are emitted by the input stream, and nothing says when that actually occurs in relation to the token stream)
- # [22:50] <Philip`> I'm not sure if there's a better way to handle those cases
- # [22:50] <Philip`> (If it seems sensible, I can try to add support for ignoreErrorOrder into html5lib)
- # [22:51] <DanC> Hixie, are you still on holiday?
- # [22:51] <jgraham> What can you do then except count errors?
- # [22:51] <hsivonen> Philip`: I put the semicolon check in the wrong place...
- # [22:52] <Philip`> jgraham: About simplejson: I do have that installed, and html5lib appears to be importing it successfully
- # [22:52] <jgraham> Which version of simple json and which of python?
- # [22:53] * jgraham has simplejson 1.7.1 and python 2.5
- # [22:53] <hsivonen> it probably makes a difference whether you've got UTF-16 Python (OS X) or UTF-32 Python (Debian)
- # [22:54] <Hixie> DanC: yup
- # [22:54] <hsivonen> (making programs change meaning depending on how the interpreter was compiled is extremely bad idea, but that's the way Python is)
- # [22:55] <Hixie> DanC: 2 and a half more weeks, just checking in to keep the e-mail under control
- # [22:55] <hsivonen> s/extremely/an extremely/
- # [22:55] <DanC> ok. enjoy.
- # [22:55] <Philip`> jgraham: Counting errors and checking that the output characters are correct is still useful, e.g. I see "\r\u0000" being no parse error and "\n\u0000" (when it should have one parse error, but it doesn't matter whether it's before or between the characters)
- # [22:56] <Philip`> (*it should have one parse error and "\n\uFFFD")
- # [22:56] <jgraham> I guess
- # [22:56] <Philip`> s/being/being parsed by html5lib into/
- # [22:56] <hsivonen> should the JSON root name be different when the parse error semantics differ?
- # [22:57] <hsivonen> testsWithCountedErrors or somesuch
- # [22:57] <Philip`> Hmm, maybe I don't have simplejson
- # [22:57] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: gsnedders)
- # [22:58] <Philip`> Oh, yes I do
- # [22:58] <Philip`> version 1.7.1
- # [22:58] <hsivonen> Philip`: are you on debian?
- # [22:59] <Philip`> and Python 2.5.1
- # [22:59] <Philip`> Gentoo
- # [22:59] <Philip`> compiled without the "ucs2" option
- # [22:59] <Philip`> (That is, Python compiled without the "ucs2" option)
- # [22:59] <hsivonen> Philip`: that may be the problem right there
- # [23:00] <Philip`> For the cases like
- # [23:00] <Philip`> Expected:
- # [23:00] <Philip`> [[u'Character', u'\ud800\udc00']]
- # [23:00] <Philip`> Recieved:
- # [23:00] <Philip`> [[u'Character', u'\U00010000']]
- # [23:00] <Philip`> ?
- # [23:00] <hsivonen> Philip`: yes
- # [23:00] <Philip`> Sounds quite plausible
- # [23:01] * jgraham has Ubuntu which seems to have UCS4 python
- # [23:01] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
- # [23:02] * Joins: gsnedders (gsnedders@81.132.88.104)
- # [23:02] <Philip`> My C++ tokeniser will break unpleasantly under Windows because wchar_t is 2 bytes there, but I've just ignored that for now
- # [23:02] <hsivonen> this issue is the biggest Python WTF in my book
- # [23:03] * jgraham doesn't understand the issues well enough to have a useful opinion
- # [23:03] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: Don't touch /dev/null…)
- # [23:03] * Joins: gsnedders (gsnedders@81.132.88.104)
- # [23:03] <Philip`> At least I can just use std::basic_string<int32_t> and copy-and-paste some character-trait magic and then it should work with no other changes to my code
- # [23:03] <jgraham> But it is horrible that it's different on different installations
- # [23:06] <hsivonen> if I were doing general-purpose C++, I'd use either UTF-16 internally and ICU or UTF-8 and glib
- # [23:06] * Quits: briansuda (briansuda@85.220.95.76) (Quit: briansuda)
- # [23:06] * Joins: myakura (myakura@58.88.37.26)
- # [23:06] * Quits: myakura (myakura@58.88.37.26) (Quit: Leaving...)
- # [23:07] * hsivonen doesn't trust standard C++ lib strings
- # [23:09] <Philip`> Hmm, I suppose UTF-32 might be a bad idea if my code was actually doing something useful, instead of being purely streaming and never storing strings in memory for more than a few microseconds
- # [23:10] <Philip`> I've not seen STL strings doing anything other than act like an array of characters, so that doesn't seem to be a problem
- # [23:10] <Philip`> (though maybe I'm missing some issues somewhere)
- # [23:10] <mjs> WebKit uses UTF-16 internally throughout, since the DOM APIs are defined in terms of UTF-16
- # [23:11] <hsivonen> Philip`: more to the point, I don't trust wchar_t
- # [23:11] <Philip`> wchar_t is just an integer, of almost totally undefined size :-)
- # [23:11] <Philip`> which I suppose makes it not incredibly portable
- # [23:12] <hsivonen> mjs: your XML parser uses UTF-8 internally, right?
- # [23:12] <Philip`> but std::basic_string<uint16_t> should do the same everywhere
- # [23:12] <hsivonen> mjs: so there's a conversion every time?
- # [23:12] <hsivonen> Philip`: ok
- # [23:13] <mjs> hsivonen: yeah, for libxml it converts both ways every time
- # [23:13] <hsivonen> there seems to be a tendency that Microsoft, Apple, IBM, Mozilla and Sun like UTF-16 and Gnome likes UTF-8
- # [23:14] <hsivonen> UTF-16 is more corporate than UTF-8 :-)
- # [23:15] <mjs> UTF-16 is kind of sad
- # [23:15] <mjs> because it doesn't have the nice properties of either UTF-8 or UTF-32
- # [23:15] <hsivonen> mjs: yet, Debian/Ubuntu/Gentoo Python being not sad is more trouble than being consistenly sad
- # [23:16] <Philip`> Has anyone done a UTF-21?
- # [23:16] <mjs> well, Python making the character set a compile time option is pretty ridiculous
- # [23:16] <Philip`> You'd get better space efficiency than UTF-32, and constant-time seeking to an arbitrary point in the string
- # [23:17] <Philip`> If we had 7-bit processors it'd even be nearly not stupid
- # [23:18] <hsivonen> Philip`: if you write an RFC, we'll have yet another encoding only useful for test cases as far as interchange over HTTP goes
- # [23:18] <mjs> UTF-24 would be slightly less silly
- # [23:18] <mjs> but still annoying, since unaligned access is expensive on most modern CPUs
- # [23:19] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: gsnedders)
- # [23:21] * Quits: mjs (mjs@64.81.48.145) (Quit: mjs)
- # [23:25] <DanC> is UTF-32 different from UCS-4 in any way?
- # [23:30] <hsivonen> DanC: I think there's a theoretical difference of max scalar value stored in the code unit
- # [23:31] <hsivonen> plus UTF-32 on disk or network is well-defined while UCS4 isn't, IIRC
- # [23:31] <DanC> hmm
- # [23:35] <hsivonen> DanC: but the practical difference is that UTF-32 is contemporary terminology while UCS-4 is old terminology. :-)
- # [23:35] * Joins: gsnedders (gsnedders@81.132.88.104)
- # [23:35] <DanC> ok. thanks.
- # [23:53] * Joins: hyatt (hyatt@17.203.14.212)
- # [23:57] * Quits: hyatt (hyatt@17.203.14.212) (Quit: hyatt)
- # [23:59] * Joins: hyatt (hyatt@17.203.14.212)
- # Session Close: Thu Jul 12 00:00:00 2007
The end :)