/irc-logs / w3c / #html-wg / 2007-07-11 / end

Options:

  1. # Session Start: Wed Jul 11 00:00:00 2007
  2. # Session Ident: #html-wg
  3. # [00:00] <zcorpan> would be cool
  4. # [00:00] <Philip`> It'd probably be pretty much identical to the C++ version, except for replacing 'bool' with 'var' and rewriting all the support code around the edges
  5. # [00:00] <zcorpan> although tree construction with the dom core apis won't work in some cases :|
  6. # [00:03] <hsivonen> zcorpan: other than doctype?
  7. # [00:04] <zcorpan> yeah. attributes that start with =. element names that contain &. etc
  8. # [00:05] <zcorpan> raise exceptions if you try to create them
  9. # [00:13] * Quits: tH (Rob@87.102.67.108) (Quit: ChatZilla 0.9.78.1-rdmsoft [XULRunner 1.8.0.9/2006120508])
  10. # [00:13] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  11. # [00:14] <hsivonen> http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C%21DOCTYPE%20html%3E%0A%3Ctable%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0Afoo%3C%21--%20--%3E%20%3C%21--%20--%3Ebar%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C/table%3E%0A%3Ctable%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C%21--%20--%3E%3C%21--%20--%3E%0A%3Ctr%3E%3Ctd%3ECell%3C/td%3E%3C/tr%3E%0A%3C/table%3E%0A
  12. # [00:14] <hsivonen> Check out Gecko.
  13. # [00:15] <hsivonen> Safari is closer to spec.
  14. # [00:16] <hsivonen> Safari seems to do what I proposed on list
  15. # [00:18] <zcorpan> iirc there was another proposal to use a flag. when you hit something that causes foster reparenting the flag is set to true, and then whitespace and comments are also fosterparented. the flag is set to false when you hit a table-related element again
  16. # [00:18] <zcorpan> or some such
  17. # [00:18] <zcorpan> i.e. what firefox does afaict
  18. # [00:18] * Joins: gavin (gavin@74.103.208.221)
  19. # [00:19] <zcorpan> does safari drop text nodes with only whitespace between comments?
  20. # [00:20] <zcorpan> only in table
  21. # [00:20] * Quits: hyatt (hyatt@17.203.14.191) (Quit: hyatt)
  22. # [00:20] <hsivonen> zcorpan: I was testing WebKit trunk, actually
  23. # [00:21] <hsivonen> zcorpan: release Safari doesn't put comments in the DOM
  24. # [00:22] <zcorpan> hsivonen: 3 beta for windows does
  25. # [00:22] <zcorpan> hsivonen: but i see that trunk doesn't drop the text node
  26. # [00:25] * Joins: hyatt (hyatt@17.203.14.191)
  27. # [00:27] <Philip`> Hooray, JS tokeniser works
  28. # [00:27] <Philip`> ...for a rather limited set of inputs
  29. # [00:29] <Philip`> http://canvex.lazyilluminati.com/misc/parser/tokeniser_js.html
  30. # [00:29] <Philip`> Probably only works in Firefox because I used uneval since I didn't want to actually put any effort into it
  31. # [00:29] <Philip`> but at least it handles tags and attributes alright
  32. # [00:30] <zcorpan> Philip`: nice!
  33. # [00:32] <Philip`> If anyone wants to make it work decently, please feel free :-)
  34. # [00:34] * gsnedders is tempted to ask what the point of that is
  35. # [00:36] <Philip`> It could (if it had the rest of the parser) be like http://james.html5.org/parsetree.html except without needing any server-side code
  36. # [00:36] <Philip`> I'm still not sure what the point of that would be, though
  37. # [00:37] <Philip`> But HTML can't be considered a complete platform for application development until you can write a whole web browser in it, so an HTML parser is highly useful for that
  38. # [00:37] <gsnedders> If I wanted an HTML parser in an HTML document, why not just use the browser's own parser
  39. # [00:37] <zcorpan> comparing the browser's tree with the spec
  40. # [00:37] <zcorpan> to find bugs in the spec
  41. # [00:37] <Philip`> It could provide the solution to backward-compatibility problems!
  42. # [00:38] <Philip`> Instead of <!doctype html>, just get people to use <script src=http://w3.org/2009/html5></script> as the magic line at the top of their file
  43. # [00:39] <Philip`> HTML5 UAs can detect that and remove it, while all others will execute the script
  44. # [00:39] <zcorpan> -_-
  45. # [00:39] <gsnedders> 2009? feeling optimistic? :P
  46. # [00:39] <Philip`> which can read in the rest of the document content, then use the JS HTML parser to construct the DOM
  47. # [00:40] <jgraham> Thus creating perfect interoperability and *really* slow sites
  48. # [00:40] <Philip`> It's a flawless plan
  49. # [00:40] <jgraham> :)
  50. # [00:40] <Philip`> That'll just encourage users to upgrade their browsers
  51. # [00:40] <jgraham> Good point
  52. # [00:40] <zcorpan> or leave your site
  53. # [00:41] <jgraham> Does Safari really pick up a charset attribute on the html element? AFAICT Opera and Firefox don't
  54. # [00:42] <gsnedders> doesn't appear to
  55. # [00:42] <gsnedders> (saf 3 beta os x)
  56. # [00:43] <zcorpan> doesn't per my testing either
  57. # [00:43] * jgraham hypothesises that Robert Burn's text editor added a BOM or something
  58. # [00:43] <gsnedders> anyhow, g'nite (4realz)
  59. # [00:44] <jgraham> goodnight
  60. # [00:44] <zcorpan> nn
  61. # [00:44] <gsnedders> (yes, I had to throw a "4realz" in)
  62. # [00:44] <gsnedders> back to spec reviewing tomorrow (yay! :\)
  63. # [00:44] * zcorpan too
  64. # [00:45] * gsnedders waits to be asked next term, "What did you do over the summer holidays?"
  65. # [00:45] <gsnedders> Why, review HTML 5, of course!
  66. # [00:45] <jgraham> I thought you were going to bed! ;)
  67. # [00:45] * zcorpan too
  68. # [00:45] * jgraham should sleep soon as well
  69. # [00:46] <gsnedders> jgraham: 4realz. :D
  70. # [00:48] * Parts: hasather (hasather@80.203.71.22)
  71. # [00:50] * Quits: heycam (cam@203.214.115.243) (Ping timeout)
  72. # [00:52] * zcorpan added http://simon.html5.org/test/html/parsing/encoding/002.htm
  73. # [00:53] <Philip`> I can't work out how to make my browsers stop treating every document as UTF-8, regardless of what meta-charset or actual characters they have in them...
  74. # [00:54] <zcorpan> Philip`: opera?
  75. # [00:55] * zcorpan finds it interesting that Robert says he doesn't know whether or not research has been made when he has been told that research has been made, been pointed to the relevant test cases, and to the relevant part of the spec
  76. # [00:56] <Philip`> Opera/FF/IE/Safari
  77. # [00:56] <Philip`> Maybe I configured my web server to do something...
  78. # [00:57] <Philip`> Oh, yes, that would explain it
  79. # [00:57] <zcorpan> AddDefaultCharset utf-8 ? :)
  80. # [00:59] <Philip`> Yes :-(
  81. # [00:59] <Philip`> (Well, actually, AddCharset UTF-8 .html)
  82. # [01:00] <zcorpan> ok
  83. # [01:00] * Philip` renames his file to .htm
  84. # [01:01] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  85. # [01:02] <zcorpan> nn
  86. # [01:02] * Quits: hyatt (hyatt@17.203.14.191) (Quit: hyatt)
  87. # [01:02] * Parts: zcorpan (zcorpan@84.216.41.183)
  88. # [01:05] <Philip`> When I'm not doing anything stupid, I also agree that Safari 3 on Windows (and FF2 and IE7) does ignore <html charset> but respect <meta charset>
  89. # [01:08] * Parts: billmason (billmason@69.30.57.156)
  90. # [01:34] * Joins: sbuluf (xtyh@200.49.140.181)
  91. # [01:36] * Joins: heycam (cam@130.194.72.84)
  92. # [02:10] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Ping timeout)
  93. # [02:16] * Joins: karl (karlcow@128.30.52.30)
  94. # [02:20] * Joins: Lionheart (robin@66.57.69.65)
  95. # [02:20] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  96. # [02:24] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
  97. # [02:25] * Joins: gavin (gavin@74.103.208.221)
  98. # [02:36] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  99. # [02:42] <karl> http://en.wikipedia.org/wiki/Usage_share_of_web_browsers
  100. # [02:45] <MikeSmith> karl - I look forward to seeing how these publishers of browser market-share data handle browsers running on devices other than PCs
  101. # [02:46] <karl> yep me too
  102. # [02:46] <MikeSmith> we will see the day when the numbers of people browsing from desktop PC is eclipsed by those browsing from other devices
  103. # [02:46] <karl> I wonder if Safari iPhone has a different user agent for example
  104. # [02:46] <MikeSmith> I would think it does
  105. # [02:47] <karl> MikeSmith: it is the case somewhere… I have read something about this recently
  106. # [02:47] <mjs> versions are different, basic stuff is the same
  107. # [02:47] <karl> damn I don't remember where
  108. # [02:47] <karl> good evening, mjs
  109. # [02:47] <MikeSmith> mjs - I know I owe you a follow-up on the accesskey thread
  110. # [02:48] <MikeSmith> and cheers for you guys adding node-set() support to Webkit
  111. # [02:48] <karl> http://economictimes.indiatimes.com/Indians_prefer_to_surf_Net_on_the_go/articleshow/2183516.cms
  112. # [02:48] <karl> Indians prefer to surf Net on the go
  113. # [02:49] <MikeSmith> despite whatever others may think negatively about client-side XSLT, a lot of developers like it -- having another option
  114. # [02:49] <karl> "The number of Indians accessing internet through their mobile phones is now over three times those using the PC to connect to the Web. India has 9.27 million internet subscribers as against 31.30 million users who access internet through their mobile handsets—GSM or CDMA—to read and reply to mails, download content and for online transactions, according to latest figures released by telecom regulator Trai. "
  115. # [02:49] <MikeSmith> karl - haven't read that article but I would guess that's because many don't have Net access at home
  116. # [02:49] <MikeSmith> but instead many go to Net cafe and such
  117. # [02:50] <karl> yes it is hard to know real stats on that.
  118. # [02:50] <karl> There is an infrastructure issue too.
  119. # [02:50] <mjs> MikeSmith: I think the whole discussion needs to be restarted with a clear statement of what problems it's trying to solve, and how it will avoid the pitfalls of accesskey so far
  120. # [02:50] <mjs> MikeSmith: I think the desktop implementations of it so far are laughably bad and the lack of use in desktop content reflects that
  121. # [02:51] <karl> Mobile has developed a lot in Africa, because it is easier to distribute than having to rely on cables. More flexible.
  122. # [02:51] <MikeSmith> mjs - yeah, I pretty much agree with that. I don't personally see such a compelling use case for accesskey on desktop
  123. # [02:51] <mjs> (it seems clear to me that if tapping control and then hitting F does something totally different than hitting either F or control-F, that's going to be a usability problem)
  124. # [02:52] <MikeSmith> I think this is a good example of importance of considering carefully the consequences of anything new we spec
  125. # [02:52] <MikeSmith> because in the end, after it gets deployed, we have to live with it
  126. # [02:53] <MikeSmith> I would vote for trying to spec accesskey based on how it is most often currently used in the wild
  127. # [02:54] <MikeSmith> not on what anybody hopes or wants it to be used for
  128. # [02:59] <karl> interesting benchmarks - http://krijnhoetmer.nl/irc-logs/html-wg/20070710#l-239
  129. # [03:00] * MikeSmith is trying to remember Opera's desktop numbers and wonders that users of Wii browsers will eventually account for a quite large percentage of Opera user base worldwide
  130. # [03:00] <Philip`> I would test the Ruby one too if I even vaguely knew how to write Ruby :-)
  131. # [03:02] <MikeSmith> Ruby is supposed to be so easy and wonderful to program in that you don't really need to know how to write it. it just happens, like magic
  132. # [03:05] <MikeSmith> I do like writing in Ruby, actually. When I was at a previous employer I used Ruby to write a prototype for a Web app for doing some xml-rpc interaction with and engine that indexed user e-mail message stores
  133. # [03:05] <karl> MikeSmith: certainly in the mines of King Solomon
  134. # [03:05] <Philip`> I've already written C++, Java, JavaScript, Python, Perl and OCaml today, and now it's 2am, so I think my brain will explode if I look at yet another language using the same symbols in different ways :-(
  135. # [03:05] <MikeSmith> heh
  136. # [03:06] <Philip`> and yet another way to find the length of a list
  137. # [03:06] * Philip` wonders why no two languages ever seem to do that in the same way
  138. # [03:07] <MikeSmith> Philip` - I've heard you mention OCaml before but don't know much about it ... what's it good for?
  139. # [03:07] <Philip`> It's good for making one's brain explode
  140. # [03:07] <Philip`> or at least it takes a bit of getting used to
  141. # [03:08] <Philip`> (but I had to learn SML at university a while ago, and OCaml uses mainly the same concepts)
  142. # [03:10] <Philip`> It seems to be quite useful for manipulating complex data structures - I have something like http://canvex.lazyilluminati.com/svn/tokeniser/cpp.ml to create trees of C++ code and then print them prettily, and the functions just do pattern-matching to respond to the appropriate types
  143. # [03:12] <MikeSmith> Philip` - I se
  144. # [03:12] <MikeSmith> see
  145. # [03:13] <Philip`> It's also a (mostly) functional language, so I can have an implementation the tokeniser that is side-effect-free so I can easily tell that it's not messing with things it shouldn't be messing with
  146. # [03:14] <Philip`> and the tokeniser is just based around a function which takes a state value and returns the next state value, which makes it easy to e.g. fork the character stream and see how it responds differently to different inputs
  147. # [03:15] <MikeSmith> side-effect free? hey, maybe you can write a tokenizer in XSLT 1.0 - that'd be fun, given how painful even most simple string processing is in XSLT ... side-effect freeness was one of the design goals that James Clark had for XSLT
  148. # [03:15] <MikeSmith> OCaml has a good regular expressions library?
  149. # [03:16] <Philip`> I think it does have a regular expression library but I haven't used it at all
  150. # [03:17] * Joins: olivier (ot@128.30.52.30)
  151. # [03:18] <MikeSmith> I have some professional acquaintances who are doing a lot of work using Erlang
  152. # [03:19] <Philip`> Mainly I'm using OCaml because it's interesting to try, and I'll probably have to end up writing stuff in it for the next three years or so, so I might as well learnt it now :-)
  153. # [03:19] <Philip`> *learn
  154. # [03:21] <MikeSmith> makes sense
  155. # [03:22] * MikeSmith wonders what draft Rob Burns is talking about that he says in reply to jgraham that he says, "I wonder whether anyone reads the draft"
  156. # [03:23] <MikeSmith> Rob Burns wrote a draft of something? if he did, I gotta admit I have not read it
  157. # [03:24] <Philip`> I assumed he meant the HTML5 draft
  158. # [03:24] <Philip`> I think jgraham probably has read that, though
  159. # [03:46] * Joins: Zeros (Zeros-Elip@67.154.87.254)
  160. # [04:08] <MikeSmith> hsivonen - I'm wondering if it would be too early to try to gather some people together for specific discussion about building conformance checkers
  161. # [04:10] <MikeSmith> It's worrying to think about how much work it will be to build conformance checkers other than the one you have built, in other languages
  162. # [04:10] <MikeSmith> and how we can try to ensure that they report the same results
  163. # [04:11] <MikeSmith> because we have some grief waiting if we create a situation where multiple HTML5 conformance checkers are in common use, but reporting different results for a check of the same document
  164. # [04:15] <MikeSmith> almost suggests that what me might end up needing is a spec that describes conformant behavior for conformance checkers ...
  165. # [04:27] <Zeros> MikeSmith, that was kind of Hixie's intent, that there is no single "correct" checker, just the spec.
  166. # [04:27] <Zeros> And then everyone gets to implement their own if they want one
  167. # [04:27] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  168. # [04:28] <MikeSmith> Not that it matters, but I don't think Hixie's intent was that we end up with multiple conformance checkers reporting conflicting results for the same page
  169. # [04:29] <MikeSmith> I'm obviously not suggesting we have one "correct" checker
  170. # [04:29] <Zeros> As I understand it he wanted people to work together to improve each of the validators
  171. # [04:30] <MikeSmith> jesus
  172. # [04:30] <Zeros> If reporting the same results is all that matters then we might as well have a single validator.
  173. # [04:30] <MikeSmith> no, obviously not
  174. # [04:31] <Zeros> it ensures the validation results are always the same
  175. # [04:31] <Zeros> which is precisely what you want :)
  176. # [04:31] <MikeSmith> Zeros, not it's not, genius
  177. # [04:31] <Zeros> Wow, you get nasty fast.
  178. # [04:31] <MikeSmith> I do when you try to put words into my mouth
  179. # [04:32] <Zeros> you said "ensure that they report the same results"
  180. # [04:32] <Zeros> that was your words
  181. # [04:32] * Joins: gavin (gavin@74.103.208.221)
  182. # [04:34] <Zeros> heh, nice talking to you though
  183. # [04:35] * Quits: Zeros (Zeros-Elip@67.154.87.254) (Quit: Leaving)
  184. # [04:58] * Quits: heycam (cam@130.194.72.84) (Quit: bye)
  185. # [05:01] * Quits: Lionheart (robin@66.57.69.65) (Connection reset by peer)
  186. # [05:29] * Quits: dbaron (dbaron@63.245.220.242) (Quit: 8403864 bytes have been tenured, next gc will be global.)
  187. # [05:39] * Joins: Lionheart (robin@66.57.69.65)
  188. # [06:34] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  189. # [06:39] * Joins: gavin (gavin@74.103.208.221)
  190. # [07:24] * Quits: olivier (ot@128.30.52.30) (Quit: This computer has gone to sleep)
  191. # [07:50] * Joins: olivier (ot@128.30.52.30)
  192. # [07:55] * Joins: heycam (cam@203.214.115.243)
  193. # [08:04] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Client exited)
  194. # [08:14] * Joins: MikeSmith (MikeSmith@mcclure.w3.org)
  195. # [08:36] * Joins: NiColasS (nicolas@213.7.50.164)
  196. # [08:36] <NiColasS> I am using html from now on !
  197. # [08:40] <NiColasS> which doctype is more recommended ?
  198. # [08:41] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  199. # [08:44] * Quits: NiColasS (nicolas@213.7.50.164) (Quit: NiColasS)
  200. # [08:46] * Joins: gavin (gavin@74.103.208.221)
  201. # [08:54] <hsivonen> jgraham: re test case format: "all the attributes must be given, in alphabetical order". Can we change that to lexicographically sorted by UTF-16 code unit?
  202. # [08:55] <hsivonen> I expect it already means that
  203. # [09:09] <hsivonen> MikeSmith: I'm really bad at estimating how much time a given piece of a conformance checker takes
  204. # [09:10] <hsivonen> MikeSmith: but sure, I can discuss stuff
  205. # [09:11] <MikeSmith> hsivonen - cool
  206. # [09:15] <MikeSmith> hsivonen - would it be accurate to say that your checker has a mechanism for making it possible to expose parse events for non well-formed XML as a stream of SAX events that XML tools can handle?
  207. # [09:15] <MikeSmith> (I don't have your thesis in front of me now, so going from memory...)
  208. # [09:16] <hsivonen> It has a mechanism for exposing HTML as SAX events and a mechanism for exposing XML as SAX events
  209. # [09:16] <hsivonen> I make no effort to fix bad stuff labeled as XML
  210. # [09:16] <MikeSmith> hsivonen - OK
  211. # [09:17] <hsivonen> and the HTML mechanism is being replaced
  212. # [09:17] <MikeSmith> anyway, I think that technique is particularly valuable
  213. # [09:17] <MikeSmith> replaced?
  214. # [09:17] <MikeSmith> you mean you will be rewriting the parsing algorithm?
  215. # [09:17] <MikeSmith> to match the spec?
  216. # [09:18] <hsivonen> yes.
  217. # [09:19] <hsivonen> Actually, instead of "will" it is "almost have"
  218. # [09:21] * Joins: tH (Rob@87.102.67.108)
  219. # [09:22] <MikeSmith> hsivonen - OK
  220. # [09:24] <MikeSmith> so I would hope we could take that technique for making it possible to handle non-WF HTML and use it in other languages as well
  221. # [09:25] <MikeSmith> for one thing, it would free validators from dependence on nsgmls
  222. # [09:26] <MikeSmith> or any dependence on SGML tools at all
  223. # [09:27] <hsivonen> what I'm writing now handles non-WF HTML
  224. # [09:28] <MikeSmith> hsivonen - understood. what I meant to say was that others could write other implementations using the same technique you developed
  225. # [09:29] <hsivonen> (I didn't develop the technique. I learned it from John Cowan.)
  226. # [09:30] <MikeSmith> ah
  227. # [09:31] <MikeSmith> I gotta admit I didn't pay much attention to TagSoup before
  228. # [09:31] <hsivonen> MikeSmith: my understanding is that Henry S. Thompson is working on a generic but vocabulary-specifically configurable schema-guided soup-XML parser using ideas from TagSoup
  229. # [09:32] <MikeSmith> but I remember your thesis saying the the TagSoup approach wasn't really suitable for a validation/conformance checker because its principle aim it just "fix" (or whatever term) the source
  230. # [09:32] <MikeSmith> not to report errors in the source
  231. # [09:33] <hsivonen> MikeSmith: whereas Anne is working on one that is generic and not per-vocabulary configurable
  232. # [09:33] <MikeSmith> OK
  233. # [09:33] <hsivonen> MikeSmith: yeah, TagSoup is for apps that don't care about errors
  234. # [09:33] <hsivonen> MikeSmith: Petr Nálevka added error reporting to TagSoup, though
  235. # [09:34] <hsivonen> TagSoup doesn't conform to HTML 5, of course
  236. # [09:34] <hsivonen> it does its own thing
  237. # [09:35] * Quits: karl (karlcow@128.30.52.30) (Quit: Where dwelt Ymir, or wherein did he find sustenance?)
  238. # [09:36] * Quits: olivier (ot@128.30.52.30) (Quit: Leaving)
  239. # [09:44] * Joins: zcorpan (zcorpan@84.216.42.141)
  240. # [10:07] <jgraham> hsivonen: What's the difference between "alphabetic" and "lexicographically sorted"?
  241. # [10:17] <hsivonen> jgraham: case, for one
  242. # [10:17] <hsivonen> jgraham: also, it is unambiguous for non-a-to-z characters
  243. # [10:18] <jgraham> Does lexicographically sorted just mean sorted by code point index?
  244. # [10:18] <hsivonen> jgraham: yes.
  245. # [10:18] <jgraham> OK. That sounds sensible.
  246. # [10:18] <hsivonen> jgraham: except since we are both using UTF-16, we probably want to sort by code unit instead of code point
  247. # [10:18] <hsivonen> (matters for astral stuff)
  248. # [10:18] <jgraham> OK
  249. # [10:20] * Quits: spleen_blender (notgonnage@72.16.243.238) (Connection reset by peer)
  250. # [10:21] <hsivonen> edit the wiki
  251. # [10:22] * Quits: sbuluf (xtyh@200.49.140.181) (Ping timeout)
  252. # [10:25] <hsivonen> edited
  253. # [10:25] <hsivonen> that is, I edited
  254. # [10:26] <hsivonen> didn't mean to suggest that you edit it
  255. # [10:26] <hsivonen> anyway...
  256. # [10:27] <jgraham> Ah, I was just about to do it :)
  257. # [10:28] <jgraham> I actually think the description of the format there isn't very close to what we have implemented
  258. # [10:30] <hsivonen> jgraham: any major changes that I should be aware of?
  259. # [10:30] <jgraham> For example You don't need the #errors section to follow the #data section
  260. # [10:30] <jgraham> Basically the implementation we have assumes:
  261. # [10:30] <jgraham> #data starts a new test
  262. # [10:31] <jgraham> There is a known list of subsections of test data which all start #something
  263. # [10:32] <hsivonen> eek. that's more complicated than absolutely necessary :-(
  264. # [10:32] <jgraham> hsivonen: I can make changes if you want
  265. # [10:33] <hsivonen> jgraham: I'd prefer the order of the subsections to be predictable. even better if all subsections were always there
  266. # [10:33] <jgraham> but a goal is to have the format slightly extensible so we can add extra (optional) sections to the tests like #innerHTML for the fragment case
  267. # [10:34] <hsivonen> but most of all, I'd prefer the the sections to be considered to end with LF# instead of LF#foo
  268. # [10:35] <jgraham> Well I guess there's no problem saying "Any line that starts '#' is a new subsection"
  269. # [10:36] <hsivonen> good
  270. # [10:36] <jgraham> It just means you can't have test data with that string in
  271. # [10:36] <hsivonen> not much of a loss given that # isn't that interesting in test data
  272. # [10:37] <hsivonen> at the start of a line
  273. # [10:37] <jgraham> Indeed
  274. # [10:39] <hsivonen> Java API designers have a lot to learn from Python
  275. # [10:39] <hsivonen> 4 lines to instantiate an XML parser
  276. # [10:40] <hsivonen> another 4 to instantiate an XML serializer
  277. # [10:40] <hsivonen> going with the default
  278. # [10:48] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  279. # [10:53] * Joins: gavin (gavin@74.103.208.221)
  280. # [11:01] * Joins: ROBOd (robod@86.34.246.154)
  281. # [11:06] * Quits: mjs (mjs@17.255.104.239) (Quit: mjs)
  282. # [11:27] <hsivonen> I now have something that runs and dumps a tree in the html5lib format
  283. # [11:27] <hsivonen> still a lot of known brokennes to fix
  284. # [11:30] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
  285. # [11:49] * Joins: zcorpan (zcorpan@84.216.42.141)
  286. # [11:57] <gsnedders> only 12 messages overnight on public-html… the traffic is really going down
  287. # [11:59] <MikeSmith> gsnedders - please don't talk about it. If you mention that, it'll jinx it and next thing you know we'll have a new thread about the indeting style of the source for the spec
  288. # [12:00] <gsnedders> MikeSmith: or worse — versioning.
  289. # [12:00] <MikeSmith> heh :9
  290. # [12:00] <MikeSmith> we need a FAQ really
  291. # [12:00] <MikeSmith> a preemptive FAQ
  292. # [12:01] <MikeSmith> "Please no versioning discussion ... please if you think the we should force authors to write only in valid XML, find another place to discuss it ... " etc.
  293. # [12:02] <gsnedders> if you want x, y, z see public-xhtml2
  294. # [12:03] <gsnedders> it's amazing how many people on public-html have a view that goes against what is in-scope for this WG, and is in-scope for XHTML2
  295. # [12:03] <MikeSmith> Or to put it another way, "These topics have been discussed at great length already, and it's not likely that whatever you might have to say about it is going change the worlds and magically bring everything to a resolution"
  296. # [12:04] <MikeSmith> gsnedders - yeah
  297. # [12:04] <gsnedders> Or, to put it in another way again, "STFU."
  298. # [12:04] <MikeSmith> heh
  299. # [12:05] <MikeSmith> anyway, amazing to me that people don't see that you can author in whatever language you want and transform your content to HTML
  300. # [12:05] <MikeSmith> can create your own perfect authoring language that exactly meets whatever criteria you have
  301. # [12:06] <MikeSmith> and then try to convince others to use that for authoring if you want
  302. # [12:06] <MikeSmith> but it does not need to be directly supported in browsers
  303. # [12:06] <zcorpan> phew! document.title was insane
  304. # [12:07] <zcorpan> (and fun)
  305. # [12:08] * MikeSmith reads zcorpan message on document.title
  306. # [12:12] <gsnedders> zcorpan: what do you expect? it's HTML!
  307. # [12:13] <zcorpan> gsnedders: oh sure :)
  308. # [12:13] <MikeSmith> zcorpan - so current spec doesn't match behavior of any current browser?
  309. # [12:13] <zcorpan> MikeSmith: right
  310. # [12:14] <zcorpan> my proposal doesn't either, but is closer
  311. # [12:15] <MikeSmith> yeah
  312. # [12:15] <MikeSmith> it's great to be getting these detailed spec reviews posted to public-html
  313. # [12:16] <MikeSmith> I think a side effect of it'l be to try to set higher expectations about what't appropriate for the list
  314. # [12:18] <zcorpan> yeah
  315. # [12:19] <zcorpan> and encourage others to do detailed reviews
  316. # [12:20] <MikeSmith> I think at some point we need to take a hard look at what has actually been accomplished by having a W3C working group participating in work on the HTML5 spec that might not have been accomplished by having the discussion take place only on the WHATWG list
  317. # [12:21] <MikeSmith> but not sure what we can point to so far as far as that goes
  318. # [12:21] <zcorpan> headers="" research
  319. # [12:22] <MikeSmith> OK, true that
  320. # [12:23] <MikeSmith> I think another anticipated benefit of having discussion within a W3C context was that it would facilitate and encourage participation from Microsoft ...
  321. # [12:23] <zcorpan> yeah
  322. # [12:25] <gsnedders> currently at 1212 tests for #numbers
  323. # [12:25] <gsnedders> still parts of the algorithm not tested, though
  324. # [12:25] <gsnedders> *algorithms
  325. # [12:26] <zcorpan> man, if we keep up this rate, we will have 20,000 tests at 2010 for sure
  326. # [12:27] <zcorpan> that doesn't mean we will have complete implementations though
  327. # [12:27] <gsnedders> zcorpan: what I'm doing for the numbers is a massive advantage for the numbers though: use each test input data for each number algorithm
  328. # [12:27] <gsnedders> (which massively increases the amount of invalid input, and checking of the error handling)
  329. # [12:30] <zcorpan> ok, document.body is next
  330. # [12:31] * gsnedders likes the fact that Bungie still cares about those who don't use Windows (insofar as they make sure their site works on Safari, they encode many videos in MPEG standards as well as WMV, etc.)
  331. # [12:34] <hsivonen> zcorpan: good stuff on the list
  332. # [12:39] <zcorpan> hsivonen: thanks
  333. # [12:46] * Joins: mjs (mjs@64.81.48.145)
  334. # [12:47] * Quits: MikeSmith (MikeSmith@mcclure.w3.org) (Quit: Less talk, more pimp walk.)
  335. # [12:47] * Joins: Sander (svl@80.60.87.115)
  336. # [12:47] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
  337. # [12:48] * Joins: zcorpan (zcorpan@84.216.42.141)
  338. # [12:51] * Joins: alexf (alejandro@85.152.42.1)
  339. # [12:51] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
  340. # [13:13] * Joins: myakura (myakura@58.88.37.26)
  341. # [13:18] <gsnedders> 20,000 tests probably won't be enough, actually :P
  342. # [13:19] * gsnedders moves on to lists of integers
  343. # [13:24] <Philip`> Are you manually verifying the output for each of these 1212 tests? :-)
  344. # [13:24] <gsnedders> no
  345. # [13:25] <gsnedders> each was written by hand, and not relying on any implementation, though
  346. # [13:25] <gsnedders> it's over 1500 now anyway :)
  347. # [13:32] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  348. # [13:37] * Joins: gavin (gavin@74.103.208.221)
  349. # [13:41] * Philip` wonders why http://www.whatwg.org/specs/web-apps/current-work/multipage/section-entities.html sorts semicolons before end-of-strings
  350. # [13:42] <Philip`> (since that doesn't seem like a natural sorting order, and it's not an order that helps with the way I'm trying to implement it)
  351. # [13:53] * Joins: zcorpan (zcorpan@84.216.42.141)
  352. # [13:53] * Joins: karl (karlcow@128.30.52.30)
  353. # [13:57] * Joins: olivier (ot@128.30.52.30)
  354. # [14:05] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
  355. # [14:11] * Philip` gets down to two-and-a-half test failures
  356. # [14:21] <Philip`> Aha, only half a test failure now
  357. # [14:33] <Philip`> Oh, but I have an untested bug :-(
  358. # [14:34] <Philip`> but it looks like I'm not the only one
  359. # [14:35] * Quits: myakura (myakura@58.88.37.26) (Quit: Leaving...)
  360. # [14:36] * Quits: Sander (svl@80.60.87.115) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
  361. # [14:37] <Philip`> hsivonen: I believe <h a='&noti'> should return an attribute with value "&noti", but you give "\u????i" (for some value of ? that I don't know off the top of my head)
  362. # [14:39] <jmb> U+00AC, I'd expect
  363. # [14:42] <Philip`> Ah, that number sounds familiar
  364. # [14:42] * Philip` fixes that bug in his own code
  365. # [14:48] * Joins: zcorpan (zcorpan@84.216.42.141)
  366. # [15:02] * Quits: karl (karlcow@128.30.52.30) (Quit: This computer has gone to sleep)
  367. # [15:18] * Quits: olivier (ot@128.30.52.30) (Quit: Leaving)
  368. # [15:40] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  369. # [15:45] * Quits: Lionheart (robin@66.57.69.65) (Connection reset by peer)
  370. # [15:45] * Joins: gavin (gavin@74.103.208.221)
  371. # [16:33] * Joins: billmason (billmason@69.30.57.156)
  372. # [16:45] <hsivonen> Philip`: does what you said about &noti still apply after "* Philip` fixes that bug in his own code"?
  373. # [16:49] <Philip`> hsivonen: Yes - I was handling that case incorrectly, and your code was doing it incorrectly too, and then I fixed my code, but yours is still incorrect
  374. # [16:50] <hsivonen> Philip`: ok. thanks. Is there a tokenizer-level test case about this in the html5lib repo?
  375. # [16:51] <Philip`> I added one to http://html5lib.googlecode.com/svn/trunk/testdata/tokenizer/test1.test ("Entity in attribute without semicolon ending in i")
  376. # [16:53] <hsivonen> Philip`: ok. thanks. I'll take a look tomorrow
  377. # [16:53] <Philip`> (Probably not the best description, since it's more relevant that it's almost but not quite 'notin', but I'm no good at describing these things concisely)
  378. # [16:54] <hsivonen> Philip`: btw, do you have a smarter implementation approach for it than what I have?
  379. # [16:56] <gsnedders> hmmm… "10" as a list of integers returns [1]
  380. # [16:58] <Philip`> I think my code is about the same as what you did (which is not coincidental) - it has a sorted array of entity names, then finds the range of names which match the first character (using a binary search, with STL doing all the hard work), then finds the subrange that match the second character, then repeats until the range has size zero/one
  381. # [16:58] <Philip`> (remembering any complete matches which it finds along the way)
  382. # [16:59] * Joins: spleen_blender (notgonnage@72.16.243.238)
  383. # [17:00] <Philip`> That seems to be generally sensible, since it never reads more characters than are required, and it doesn't waste loads of memory (e.g. on a trie)
  384. # [17:04] <Philip`> (http://canvex.lazyilluminati.com/svn/tokeniser/tokeniser.cpp at around where it says "entityNames")
  385. # [17:05] <gsnedders> Actually implementing the algorithms and having > 1500 test cases is finally paying off
  386. # [17:18] <gsnedders> these algorithms have amazingly few bugs
  387. # [17:25] * Philip` creates 1011 tokeniser test cases
  388. # [17:25] <gsnedders> DanC: feeling brave saying you'll publish all three documents against objections? I don't want to hear the mailing list when you do.
  389. # [17:25] <Philip`> ...and I've found one bug, though I don't know which implementation is the buggy one
  390. # [17:25] <DanC> it's my job
  391. # [17:26] <gsnedders> DanC: heh. so many people will complain. probably end up people acting my age.
  392. # [17:27] <Philip`> Hmm, html5lib agrees with me
  393. # [17:27] <Philip`> Input: "<z/0 <"
  394. # [17:27] <Philip`> Question: How many parse errors?
  395. # [17:27] <gsnedders> is that an opening or closing tag?
  396. # [17:28] <gsnedders> the former, I assume?
  397. # [17:29] <Philip`> Oops, the difference is not just parse errors
  398. # [17:29] <Philip`> I get ["ParseError", "ParseError", ["StartTag", "z", {"0": "", "<": ""}]]
  399. # [17:30] <Philip`> hsivonen's says: ["ParseError","ParseError",["StartTag","z",{"0":""}],"ParseError",["Character","<"]]
  400. # [17:30] * Philip` tries to work out what's happening
  401. # [17:32] <Philip`> hsivonen: afterAttributeNameState does a "case '<':" but the spec doesn't say anything about handling < in that state
  402. # [17:38] <zcorpan> did before i think
  403. # [17:41] * Philip` makes another 2115 test cases
  404. # [17:41] <Philip`> (Am I winning yet?)
  405. # [17:42] <zcorpan> Philip`: are you shitting test cases? :)
  406. # [17:42] <Philip`> Bah, I didn't find any new bugs that time
  407. # [17:44] <hsivonen> zcorpan: my girlfriend tests water purifiers. she uses that kind of test material. ;-)
  408. # [17:44] <hsivonen> Philip`: < noted. will have a look tomorrow
  409. # [17:45] <zcorpan> hsivonen: :)
  410. # [17:47] * Philip` tries another 8145
  411. # [17:48] <Philip`> Whoops, I think zombie processes killed my test generator
  412. # [17:48] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  413. # [17:48] <spleen_blender> lol, Z
  414. # [17:53] * Joins: gavin (gavin@74.103.208.221)
  415. # [17:56] <Philip`> I just find that <-in-attribute-name case lots of times, and no other visible bugs
  416. # [17:56] <Philip`> (Er, that should be "<-in-attribute" since it matters around attribute values too)
  417. # [17:58] <Philip`> It would be easier to test html5lib if its tokeniser output format hadn't totally changed and made every test fail
  418. # [18:05] <Philip`> ...and if svn didn't completely freeze solid whenever I tried accessing the html5lib repository
  419. # [18:10] * Joins: zcorpan_ (zcorpan@84.216.42.141)
  420. # [18:10] <Philip`> Ah, good, it's not my fault, it's just Google that's broken
  421. # [18:10] * Quits: zcorpan (zcorpan@84.216.42.141) (Ping timeout)
  422. # [18:54] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
  423. # [18:59] * Joins: ROBOd (robod@86.34.246.154)
  424. # [19:01] * Joins: dbaron (dbaron@63.245.220.242)
  425. # [19:07] * Parts: alexf (alejandro@85.152.42.1)
  426. # [19:12] * Joins: Lionheart (robin@198.86.248.1)
  427. # [19:16] * Joins: edas (edaspet@88.191.34.123)
  428. # [19:27] <gsnedders> hsivonen: can you try running "-a" through the list of integers algorithm?
  429. # [19:31] <gsnedders> hsivonen: actually, that's wrong. "-" is what I'm interested in.
  430. # [19:54] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  431. # [19:59] * Joins: gavin (gavin@74.103.208.221)
  432. # [20:06] <zcorpan_> DanC: i might be here on irc during the telecon tomorrow, but in any case: i'm willing to help with test suite organization
  433. # [20:12] <DanC> ah. interesting.
  434. # [20:14] * Quits: xover (xover@193.157.66.5) (Ping timeout)
  435. # [20:15] <DanC> any ideas on how to organize tests, zcorpan_ ?
  436. # [20:16] <DanC> I'm interested in tests materials that (a) aid developers in building good software, and (b) aid users in judging software and reporting problems
  437. # [20:17] <DanC> stuff that captures issues that people care about with objective results
  438. # [20:18] <DanC> the GRDDL spec has a much smaller scope, but we came up with a few dozen tests and we have test results in machine-readable form from a handful of implementations. http://www.w3.org/2001/sw/grddl-wg/td/test_results
  439. # [20:19] <DanC> I expect we'll need several different kinds of tests for HTML
  440. # [20:19] <DanC> we'll be able to automate some parts more than others, I expect
  441. # [20:25] * Joins: Sander (svl@80.60.87.115)
  442. # [20:27] * Quits: edas (edaspet@88.191.34.123) (Ping timeout)
  443. # [20:31] <gsnedders> too… much… spec…
  444. # [20:32] <zcorpan_> DanC: haven't thought much about it yet
  445. # [20:33] <zcorpan_> DanC: although we will have thousands of tests
  446. # [20:33] <zcorpan_> (or already have)
  447. # [20:34] <DanC> too much spec for what?
  448. # [20:34] <gsnedders> DanC: to review
  449. # [20:35] <zcorpan_> gsnedders: concentrate on one thing at a time :)
  450. # [20:35] <gsnedders> zcorpan_: I am
  451. # [20:35] <gsnedders> zcorpan_: It still seems endless, though
  452. # [20:36] <gsnedders> some day I'll misspell "microsyntaxes" in the subject line of one of my emails…
  453. # [20:36] * Joins: xover (xover@193.157.66.5)
  454. # [20:36] <DanC> I find it hard to believe that a BNF or regex isn't easier to specify and review than english prose for stuff like microsyntaxes. oh well.
  455. # [20:38] <gsnedders> DanC: a lot of the requirements would end up being English prose anyway
  456. # [20:38] <DanC> for example?
  457. # [20:38] <zcorpan_> DanC: perhaps we could just check in all tests at http://code.google.com/p/html5/
  458. # [20:38] <gsnedders> DanC: all the various times when you exist the ratios algorithm. you'd end up with so many alternatives in BNF or regex
  459. # [20:38] <gsnedders> *exit
  460. # [20:39] <DanC> what's a few more alternatives? this is a rather mature part of computer science.
  461. # [20:40] <gsnedders> DanC: take a look at #ratios, it isn't overly long, but I can't think of many easy ways of expressing that
  462. # [20:42] <DanC> I don't see anything that won't fit in a regex
  463. # [20:42] <gsnedders> It won't be an overly simple one though
  464. # [20:42] <DanC> so?
  465. # [20:42] <gsnedders> I'd rather have prose than complex regex
  466. # [20:42] <DanC> oh well.
  467. # [20:43] <DanC> you're doing the work, not me.
  468. # [20:44] <DanC> zcorpan_, hosting at code.google.com might work, as long as we can keep a copy in w3.org somewhere too. does code.google.com offer an rsync interface?
  469. # [20:44] <DanC> I'd rather use a decentralized version control system like hg or bzr or git
  470. # [20:44] <gsnedders> DanC: and anne said that I was the person to ask if he ever needed an overly complex regex :P
  471. # [20:48] <gsnedders> DanC: ^[^.0123456789]*([0123456789]+\.[0123456789]*|[0123456789]*\.[0123456789]+|[0123456789])(<unicode character class Zs>)*((%|٪|﹪|%|‰|‱)[^0123456789]*|[^.0123456789]*([0123456789]+\.[0123456789]*|[0123456789]*\.[0123456789]+|[0123456789])[^0123456789%٪﹪%‰‱]*)$
  472. # [20:48] <gsnedders> DanC: I think that expresses the algorithm…
  473. # [20:49] <DanC> written out as BNF, it's probably quite straightforward
  474. # [20:49] <DanC> since, for example, [0123456789] gets factored out as <digit>
  475. # [20:50] <gsnedders> *DIGIT "." +DIGIT / +DIGIT "." *DIGIT / DIGIT covers a floating point number in ABNF, I think
  476. # [20:50] <gsnedders> It'll be simpler than URI's ABNF for certain though
  477. # [20:51] <DanC> when you said "a lot of the requirements would end up being English prose anyway" I thought you were saying that there are constraints that can't be expressed in BNF. I don't see any so far.
  478. # [20:52] <gsnedders> ratios probably isn't the best of examples
  479. # [20:53] <DanC> I think it's worth publishing BNF for these things, even if it has to be a separate document. I've got a handful of volunteers for the formalization task.
  480. # [20:53] <gsnedders> dates would end up being verbose if you wanted to be exact (RFC3339's ABNF allows hours > 24, minutes > 59, seconds > 60)
  481. # [20:54] <gsnedders> I've got nothing against publishing some sort of BNF, but I'd rather the prose were the only normative part of the standard
  482. # [20:54] <DanC> true, capturing leap year rules in regex's isn't worthwhile. \d\d\d\d-\d\d-\d\d plus some prose constraints is a happy medium.
  483. # [20:55] <gsnedders> I liked BNF more before I started dealing with URIs and IRIs.
  484. # [20:55] <DanC> 3.2.3.4. Ratios doesn't motivate any of the complexity.
  485. # [20:56] <DanC> the regex at the end of the URI spec works much better than the BNF. URIs aren't parsed top-down like programming languages; they're chopped up piece by piece
  486. # [20:57] <gsnedders> they aren't that nice when you do try and parse them without using regex
  487. # [20:58] <gsnedders> </complete:understatement>
  488. # [20:58] <DanC> URI syntax is particularly horrible, and it took a long time to figure out the bounds of the standard. (it's still ongoing).
  489. # [20:58] <zcorpan_> DanC: don't know
  490. # [21:00] <DanC> I wonder where this %|٪|﹪|%|‰|‱ stuff came from. Surely no cows blazed any path like that.
  491. # [21:00] <gsnedders> DanC: it's allowed as content of elements
  492. # [21:01] <DanC> yes, but why bother with ‱ ? is that really worthwhile? why go beyond one % character?
  493. # [21:01] <gsnedders> hmm… URIs predate me :\ (though I think any standards regarding URL/URIs are younger)
  494. # [21:01] <gsnedders> DanC: the arabic one is really used by arabic people.
  495. # [21:03] <DanC> wild. I think it's best to return to my "I don't care what design you come up with, as long as there are plenty of tests and the implementors are willing to pass them" mode.
  496. # [21:03] * DanC needs lunch
  497. # [21:03] * DanC is late for a telcon :-/
  498. # [21:27] * Quits: Lionheart (robin@198.86.248.1) (Ping timeout)
  499. # [21:41] <Philip`> Alas, I don't find any interesting bugs in the Python html5lib with my ~8K tokeniser tests :-(
  500. # [21:43] * Joins: briansuda (briansuda@85.220.95.76)
  501. # [21:47] * gsnedders adds with great glee to his commit message: "This is enough to test every algorithm within #numbers in the revision we're testing."
  502. # [21:47] <gsnedders> 1890 tests (inc. 315 ignored — the percentages and dimensions section which in the spec is TBW)
  503. # [21:49] <Philip`> Oops, they are actually interesting bugs
  504. # [21:50] * Philip` looks at them
  505. # [21:52] <Philip`> "<!doctype html \u000D"
  506. # [21:53] <Philip`> "<z \u000D"
  507. # [21:53] * Philip` sees a pattern
  508. # [21:53] <gsnedders> what is U+000D? CR?
  509. # [21:54] <gsnedders> if so, are you just testing the tokeniser, or the input stream as well (as they are dealt with in there)?
  510. # [21:54] <Philip`> Yep, CR
  511. # [21:55] <Philip`> This includes the input stream (since everyone seems to implement that as kind of part of the tokeniser)
  512. # [21:55] <Philip`> and that CR isn't followed by an LF, so an LF should be emitted
  513. # [21:55] <hsivonen> Philip`: zapped the < case in after attribute name. looks like I missed when Hixie zapped it from the spec.
  514. # [21:55] <hsivonen> Philip`: thanks
  515. # [21:56] <hsivonen> s/missed/missed it/
  516. # [21:56] <Philip`> and that LF is whitespace and gets skipped over, until the EOF is hit
  517. # [21:57] <hsivonen> gsnedders: Do I have an implementation of list of integers somewhere?
  518. # [21:57] <hsivonen> gsnedders: If my memory serves me correctly, I used a big regexp--not the algorithm
  519. # [21:58] <hsivonen> gsnedders: if the algorithm is for stuff like area coordinates
  520. # [21:58] <gsnedders> hsivonen: I was making an assumption that you did somewhere in the conformance checker
  521. # [21:58] * hsivonen looks at the spec
  522. # [21:58] <gsnedders> but yes, things like @coords
  523. # [21:59] <hsivonen> it has been a while since I have touched those parts
  524. # [21:59] <gsnedders> hsivonen: I'm just checking I haven't gone wrong somewhere. It's the one issue I've found I'm least sure about.
  525. # [21:59] <hsivonen> gsnedders: OK. I didn't implement the algorithm. I just took a hard look at it and wrote a regexp that is supposed to accept the same strings
  526. # [22:01] * Quits: gavin (gavin@74.103.208.221) (Ping timeout)
  527. # [22:01] <hsivonen> gsnedders: http://syntax.whattf.org/relaxng/embed.rnc
  528. # [22:02] <gsnedders> hsivonen: so all you do is check for errors, therefore don't have a result
  529. # [22:02] <hsivonen> gsnedders: yeah
  530. # [22:02] <hsivonen> gsnedders: I did implement the ratio algorithm, though
  531. # [22:03] <hsivonen> Philip`: It seems I forgot to make separate entity tables for attributes
  532. # [22:03] * Quits: Sander (svl@80.60.87.115) (Quit: And back he spurred like a madman, shrieking a curse to the sky.)
  533. # [22:07] <hsivonen> doh. I no separate tables needed
  534. # [22:07] <hsivonen> s/I//
  535. # [22:07] <hsivonen> I keep forgetting what code I have written and why
  536. # [22:07] * Joins: gavin (gavin@74.103.208.221)
  537. # [22:09] * Joins: Lionheart (robin@66.57.69.65)
  538. # [22:10] <Philip`> hsivonen: "<z/0=<"
  539. # [22:10] <Philip`> or I guess "<z x=<" except I haven't tested that particular case
  540. # [22:10] <Philip`> results in the tag being closed, instead of < in the attribute value
  541. # [22:11] <Philip`> hsivonen: The problem I had with attribute entities is that I was checking the last consumed character, instead of the character after the longest entity match, so I just had to fix that to examine the correct character
  542. # [22:13] <spleen_blender> lol hsiv, story of my life
  543. # [22:14] <zcorpan_> gsnedders: the TBW markers are out of sync
  544. # [22:14] * Philip` needs to find a way to run tests in his tokeniser without starting a new process for every test case
  545. # [22:16] <hsivonen> Philip`: do you mean I still have a "<" bug to fix?
  546. # [22:16] <Philip`> hsivonen: Yes - < in attribute-value-state
  547. # [22:17] <Philip`> Uh
  548. # [22:17] <Philip`> Before-attribute-value-state?
  549. # [22:17] <Philip`> Something like that
  550. # [22:17] <Philip`> Ah, yes, it is that
  551. # [22:18] <hsivonen> Philip`: < fix checked in for before attr val
  552. # [22:18] <hsivonen> Philip`: thank you
  553. # [22:23] <Philip`> hsivonen: I can't find any more bugs now :-(
  554. # [22:24] <hsivonen> Philip`: nice
  555. # [22:25] <Philip`> Now I just have to wait until the spec changes and spawns a new set of bugs
  556. # [22:25] <hsivonen> Philip`: well, my charset sniffing is not up to date
  557. # [22:26] <hsivonen> Philip`: and I won't fix it for a while
  558. # [22:27] <zcorpan_> Philip`: i think the html5lib tests are a step before the spec. the <title><!--&amp;--></title> case
  559. # [22:28] <hsivonen> hmm. looks like I fail two encoding tests now
  560. # [22:30] <Philip`> Oh, maybe I should look at non-PCDATA at some point
  561. # [22:33] <hsivonen> oops. assertions fail when I remember to turn them on...
  562. # [22:40] <hsivonen> bah. my assertion was on the wrong line
  563. # [22:42] <jgraham> Philip`: Did you say you found a html5lib bug?
  564. # [22:42] <Philip`> jgraham: See http://html5lib.googlecode.com/svn/trunk/testdata/tokenizer/test4.test as of about two seconds ago
  565. # [22:43] <Philip`> Quite a few of those fail in html5lib for various reasons
  566. # [22:43] <Philip`> Like...
  567. # [22:43] <Philip`> unusual characters after a CR
  568. # [22:44] <Philip`> (Er, wait a minute, just trying to remember)
  569. # [22:44] <jgraham> So mostly input stream related?
  570. # [22:45] <Philip`> ...and non-BMP characters, though that's possibly just an issue with the JSON handler (since JSON is meant to do \x????\x???? surrogate pairs)
  571. # [22:46] <Philip`> ...and uppercase/lowercase tag/attribute names (though I saw you said you had a patch for that already)
  572. # [22:46] <Philip`> ...and the number of parse errors when an attribute is triplicated(?) instead of just duplicated
  573. # [22:47] <jgraham> I see 8 faliures
  574. # [22:47] <Philip`> ...and attributes on end tags
  575. # [22:47] <Philip`> ...and I think that's all
  576. # [22:47] <Hixie> gsnedders: the problem with BNF or regexp is that they don't explain the error handling properly, usually. BNF could work for defining the author requirements in some cases, i guess, though i'm not convinced that would be better than prose, and, more importantly, once you have a BNF people are way too tempted to use it to define the parsing.
  577. # [22:47] <Hixie> DanC: see above also
  578. # [22:48] <Philip`> jgraham: Do you have local modifications? (SVN seems to have totally broken tokeniser-testing at the moment, so I assume you're not just using that)
  579. # [22:48] <gsnedders> Hixie: in the case of the common microsyntaxes the error handling is normally rather consistent though, and could be put simply
  580. # [22:48] <jgraham> Philip`: I'm using svn (my local modifications shouldn't affect this at all)
  581. # [22:48] <jgraham> Do you have simplejson installed?
  582. # [22:49] * DanC tunes in...
  583. # [22:49] <Philip`> jgraham: Also I put "ignoreErrorOrder":true on some tests where the error order is undefined and the test code should ignore differences
  584. # [22:49] <DanC> oh... BNF. never mind. whatever is convenient for the editor and reviewers is fine by me.
  585. # [22:49] <Philip`> (because the errors are emitted by the input stream, and nothing says when that actually occurs in relation to the token stream)
  586. # [22:50] <Philip`> I'm not sure if there's a better way to handle those cases
  587. # [22:50] <Philip`> (If it seems sensible, I can try to add support for ignoreErrorOrder into html5lib)
  588. # [22:51] <DanC> Hixie, are you still on holiday?
  589. # [22:51] <jgraham> What can you do then except count errors?
  590. # [22:51] <hsivonen> Philip`: I put the semicolon check in the wrong place...
  591. # [22:52] <Philip`> jgraham: About simplejson: I do have that installed, and html5lib appears to be importing it successfully
  592. # [22:52] <jgraham> Which version of simple json and which of python?
  593. # [22:53] * jgraham has simplejson 1.7.1 and python 2.5
  594. # [22:53] <hsivonen> it probably makes a difference whether you've got UTF-16 Python (OS X) or UTF-32 Python (Debian)
  595. # [22:54] <Hixie> DanC: yup
  596. # [22:54] <hsivonen> (making programs change meaning depending on how the interpreter was compiled is extremely bad idea, but that's the way Python is)
  597. # [22:55] <Hixie> DanC: 2 and a half more weeks, just checking in to keep the e-mail under control
  598. # [22:55] <hsivonen> s/extremely/an extremely/
  599. # [22:55] <DanC> ok. enjoy.
  600. # [22:55] <Philip`> jgraham: Counting errors and checking that the output characters are correct is still useful, e.g. I see "\r\u0000" being no parse error and "\n\u0000" (when it should have one parse error, but it doesn't matter whether it's before or between the characters)
  601. # [22:56] <Philip`> (*it should have one parse error and "\n\uFFFD")
  602. # [22:56] <jgraham> I guess
  603. # [22:56] <Philip`> s/being/being parsed by html5lib into/
  604. # [22:56] <hsivonen> should the JSON root name be different when the parse error semantics differ?
  605. # [22:57] <hsivonen> testsWithCountedErrors or somesuch
  606. # [22:57] <Philip`> Hmm, maybe I don't have simplejson
  607. # [22:57] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: gsnedders)
  608. # [22:58] <Philip`> Oh, yes I do
  609. # [22:58] <Philip`> version 1.7.1
  610. # [22:58] <hsivonen> Philip`: are you on debian?
  611. # [22:59] <Philip`> and Python 2.5.1
  612. # [22:59] <Philip`> Gentoo
  613. # [22:59] <Philip`> compiled without the "ucs2" option
  614. # [22:59] <Philip`> (That is, Python compiled without the "ucs2" option)
  615. # [22:59] <hsivonen> Philip`: that may be the problem right there
  616. # [23:00] <Philip`> For the cases like
  617. # [23:00] <Philip`> Expected:
  618. # [23:00] <Philip`> [[u'Character', u'\ud800\udc00']]
  619. # [23:00] <Philip`> Recieved:
  620. # [23:00] <Philip`> [[u'Character', u'\U00010000']]
  621. # [23:00] <Philip`> ?
  622. # [23:00] <hsivonen> Philip`: yes
  623. # [23:00] <Philip`> Sounds quite plausible
  624. # [23:01] * jgraham has Ubuntu which seems to have UCS4 python
  625. # [23:01] * Quits: ROBOd (robod@86.34.246.154) (Quit: http://www.robodesign.ro )
  626. # [23:02] * Joins: gsnedders (gsnedders@81.132.88.104)
  627. # [23:02] <Philip`> My C++ tokeniser will break unpleasantly under Windows because wchar_t is 2 bytes there, but I've just ignored that for now
  628. # [23:02] <hsivonen> this issue is the biggest Python WTF in my book
  629. # [23:03] * jgraham doesn't understand the issues well enough to have a useful opinion
  630. # [23:03] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: Don't touch /dev/null…)
  631. # [23:03] * Joins: gsnedders (gsnedders@81.132.88.104)
  632. # [23:03] <Philip`> At least I can just use std::basic_string<int32_t> and copy-and-paste some character-trait magic and then it should work with no other changes to my code
  633. # [23:03] <jgraham> But it is horrible that it's different on different installations
  634. # [23:06] <hsivonen> if I were doing general-purpose C++, I'd use either UTF-16 internally and ICU or UTF-8 and glib
  635. # [23:06] * Quits: briansuda (briansuda@85.220.95.76) (Quit: briansuda)
  636. # [23:06] * Joins: myakura (myakura@58.88.37.26)
  637. # [23:06] * Quits: myakura (myakura@58.88.37.26) (Quit: Leaving...)
  638. # [23:07] * hsivonen doesn't trust standard C++ lib strings
  639. # [23:09] <Philip`> Hmm, I suppose UTF-32 might be a bad idea if my code was actually doing something useful, instead of being purely streaming and never storing strings in memory for more than a few microseconds
  640. # [23:10] <Philip`> I've not seen STL strings doing anything other than act like an array of characters, so that doesn't seem to be a problem
  641. # [23:10] <Philip`> (though maybe I'm missing some issues somewhere)
  642. # [23:10] <mjs> WebKit uses UTF-16 internally throughout, since the DOM APIs are defined in terms of UTF-16
  643. # [23:11] <hsivonen> Philip`: more to the point, I don't trust wchar_t
  644. # [23:11] <Philip`> wchar_t is just an integer, of almost totally undefined size :-)
  645. # [23:11] <Philip`> which I suppose makes it not incredibly portable
  646. # [23:12] <hsivonen> mjs: your XML parser uses UTF-8 internally, right?
  647. # [23:12] <Philip`> but std::basic_string<uint16_t> should do the same everywhere
  648. # [23:12] <hsivonen> mjs: so there's a conversion every time?
  649. # [23:12] <hsivonen> Philip`: ok
  650. # [23:13] <mjs> hsivonen: yeah, for libxml it converts both ways every time
  651. # [23:13] <hsivonen> there seems to be a tendency that Microsoft, Apple, IBM, Mozilla and Sun like UTF-16 and Gnome likes UTF-8
  652. # [23:14] <hsivonen> UTF-16 is more corporate than UTF-8 :-)
  653. # [23:15] <mjs> UTF-16 is kind of sad
  654. # [23:15] <mjs> because it doesn't have the nice properties of either UTF-8 or UTF-32
  655. # [23:15] <hsivonen> mjs: yet, Debian/Ubuntu/Gentoo Python being not sad is more trouble than being consistenly sad
  656. # [23:16] <Philip`> Has anyone done a UTF-21?
  657. # [23:16] <mjs> well, Python making the character set a compile time option is pretty ridiculous
  658. # [23:16] <Philip`> You'd get better space efficiency than UTF-32, and constant-time seeking to an arbitrary point in the string
  659. # [23:17] <Philip`> If we had 7-bit processors it'd even be nearly not stupid
  660. # [23:18] <hsivonen> Philip`: if you write an RFC, we'll have yet another encoding only useful for test cases as far as interchange over HTTP goes
  661. # [23:18] <mjs> UTF-24 would be slightly less silly
  662. # [23:18] <mjs> but still annoying, since unaligned access is expensive on most modern CPUs
  663. # [23:19] * Quits: gsnedders (gsnedders@81.132.88.104) (Quit: gsnedders)
  664. # [23:21] * Quits: mjs (mjs@64.81.48.145) (Quit: mjs)
  665. # [23:25] <DanC> is UTF-32 different from UCS-4 in any way?
  666. # [23:30] <hsivonen> DanC: I think there's a theoretical difference of max scalar value stored in the code unit
  667. # [23:31] <hsivonen> plus UTF-32 on disk or network is well-defined while UCS4 isn't, IIRC
  668. # [23:31] <DanC> hmm
  669. # [23:35] <hsivonen> DanC: but the practical difference is that UTF-32 is contemporary terminology while UCS-4 is old terminology. :-)
  670. # [23:35] * Joins: gsnedders (gsnedders@81.132.88.104)
  671. # [23:35] <DanC> ok. thanks.
  672. # [23:53] * Joins: hyatt (hyatt@17.203.14.212)
  673. # [23:57] * Quits: hyatt (hyatt@17.203.14.212) (Quit: hyatt)
  674. # [23:59] * Joins: hyatt (hyatt@17.203.14.212)
  675. # Session Close: Thu Jul 12 00:00:00 2007

The end :)