Options:
- # Session Start: Sun Nov 22 00:00:00 2009
- # Session Ident: #whatwg
- # [00:06] * Joins: jonpierce (n=jonpierc@64.119.130.114)
- # [00:20] * Parts: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net)
- # [00:47] * Quits: gavin_ (n=gavin@firefox/developer/gavin) (Read error: 145 (Connection timed out))
- # [00:48] * Joins: gavin_ (n=gavin@firefox/developer/gavin)
- # [00:57] * Joins: nessy (n=Adium@203-214-159-50.dyn.iinet.net.au)
- # [01:04] * Joins: erlehmann (n=erlehman@1.106.113.82.net.de.o2.com)
- # [01:11] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [01:13] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net) (Client Quit)
- # [01:22] * Quits: jonpierce (n=jonpierc@64.119.130.114)
- # [01:23] * Joins: MikeSmith (n=MikeSmit@EM114-48-25-149.pool.e-mobile.ne.jp)
- # [01:23] * Quits: MikeSmith (n=MikeSmit@EM114-48-25-149.pool.e-mobile.ne.jp) (Client Quit)
- # [01:25] * Joins: Jeromche (n=cellshoc@201.141.210.83)
- # [01:29] * Quits: Jeromche (n=cellshoc@201.141.210.83)
- # [01:43] * Joins: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net)
- # [01:56] * Quits: tndH (n=Rob@cpc2-leed18-0-0-cust427.leed.cable.ntl.com) ("ChatZilla 0.9.85-rdmsoft [XULRunner 1.9.0.1/2008072406]")
- # [02:03] * Quits: archtech (i=stanv@83.228.56.37) (Client Quit)
- # [02:15] * Joins: Huvet (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se)
- # [02:16] <Huvet> hi everyone! I'm playing around with the html5lib 0.11 python implementation, and is wondering if I might have hit a bug: http://dpaste.com/hold/123513/
- # [02:16] <Huvet> I'm parsning the HTML of swedish newspapers, which seems to we one of the worst messes in the world :(
- # [02:17] <Huvet> or, I could be doing something wrong, it would not be the first time :)
- # [02:27] * Quits: gavin_ (n=gavin@firefox/developer/gavin) (Read error: 110 (Connection timed out))
- # [02:28] * Joins: gavin_ (n=gavin@firefox/developer/gavin)
- # [02:30] * Quits: ttepasse (n=ttepas--@p5B014E4B.dip.t-dialin.net) ("?Q")
- # [02:42] <Huvet> the same error occurs on www.unt.se, and www.uhp.se too
- # [02:49] * Joins: gunderwonder (n=gunderwo@89.80-202-84.nextgentel.com)
- # [02:50] * Quits: paul_irish (n=paul_iri@64.119.130.114) (Remote closed the connection)
- # [02:53] * Joins: Arron (n=arronei@nat/microsoft/x-glkjiykrceibixcx)
- # [02:55] * Quits: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net) (sendak.freenode.net irc.freenode.net)
- # [02:55] * Quits: Huvet (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se) (sendak.freenode.net irc.freenode.net)
- # [02:55] * Quits: arronei (n=arronei@nat/microsoft/x-pcqorwlngqmvpyfw) (sendak.freenode.net irc.freenode.net)
- # [02:57] * Joins: Huvet (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se)
- # [02:57] * Joins: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net)
- # [02:59] * Joins: Huvet1 (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se)
- # [03:04] * Quits: Huvet1 (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se) ("Leaving.")
- # [03:10] * Quits: Huvet (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se) (Read error: 110 (Connection timed out))
- # [03:15] * Joins: jonpierce (n=jonpierc@209-6-91-231.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com)
- # [03:24] * Quits: jonpierce (n=jonpierc@209-6-91-231.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com)
- # [03:33] * Quits: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net) (sendak.freenode.net irc.freenode.net)
- # [03:33] * Joins: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net)
- # [03:47] * Joins: hobertoAtWork2 (n=hobertoa@gw1.mcgraw-hill.com)
- # [03:48] * Quits: hobertoAtWork (n=hobertoa@198.45.18.20) (Read error: 131 (Connection reset by peer))
- # [03:48] * Quits: TabAtkins (n=chatzill@70-139-15-246.lightspeed.rsbgtx.sbcglobal.net) (sendak.freenode.net irc.freenode.net)
- # [03:48] * Quits: ivan` (n=ivan@unaffiliated/ivan/x-000001) (sendak.freenode.net irc.freenode.net)
- # [03:48] * Quits: AryehGregor (n=Simetric@mediawiki/simetrical) (sendak.freenode.net irc.freenode.net)
- # [03:48] * Quits: jarib (i=jarib@li34-70.members.linode.com) (sendak.freenode.net irc.freenode.net)
- # [03:48] * Quits: vvv (n=vvv@mediawiki/VasilievVV) (sendak.freenode.net irc.freenode.net)
- # [03:48] * Quits: jgraham (n=jgraham@web22.webfaction.com) (sendak.freenode.net irc.freenode.net)
- # [03:49] * Joins: TabAtkins (n=chatzill@70-139-15-246.lightspeed.rsbgtx.sbcglobal.net)
- # [03:49] * Joins: ivan` (n=ivan@unaffiliated/ivan/x-000001)
- # [03:49] * Joins: AryehGregor (n=Simetric@mediawiki/simetrical)
- # [03:49] * Joins: jarib (i=jarib@li34-70.members.linode.com)
- # [03:49] * Joins: vvv (n=vvv@mediawiki/VasilievVV)
- # [03:49] * Joins: jgraham (n=jgraham@web22.webfaction.com)
- # [03:54] * Quits: Midler1 (n=midler@212.37.124.243) ("Leaving.")
- # [03:55] * Quits: ivan` (n=ivan@unaffiliated/ivan/x-000001) ("jumpin' jumpin'")
- # [03:55] * Joins: ivan` (n=ivan@unaffiliated/ivan/x-000001)
- # [03:55] * Joins: TabAtkins_ (n=chatzill@70-139-15-246.lightspeed.rsbgtx.sbcglobal.net)
- # [03:56] * Quits: jgraham (n=jgraham@web22.webfaction.com) (Read error: 131 (Connection reset by peer))
- # [03:56] * Joins: jgraham (n=jgraham@web22.webfaction.com)
- # [03:56] * Quits: TabAtkins (n=chatzill@70-139-15-246.lightspeed.rsbgtx.sbcglobal.net) (Read error: 131 (Connection reset by peer))
- # [03:56] * TabAtkins_ is now known as TabAtkins
- # [03:56] * Joins: jarib_ (i=jarib@li34-70.members.linode.com)
- # [03:56] * Quits: jarib (i=jarib@li34-70.members.linode.com) (Read error: 131 (Connection reset by peer))
- # [04:00] * Joins: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net)
- # [04:02] * Quits: vvv (n=vvv@mediawiki/VasilievVV) (sendak.freenode.net irc.freenode.net)
- # [04:02] * Quits: AryehGregor (n=Simetric@mediawiki/simetrical) (sendak.freenode.net irc.freenode.net)
- # [04:04] * Joins: AryehGregor (n=Simetric@mediawiki/simetrical)
- # [04:04] * Joins: vvv (n=vvv@mediawiki/VasilievVV)
- # [04:08] * Quits: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net) (sendak.freenode.net irc.freenode.net)
- # [04:09] * Joins: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net)
- # [04:11] * Quits: othermaciej (n=mjs@c-69-181-42-237.hsd1.ca.comcast.net)
- # [04:12] * Quits: wm3|bed (n=davidwor@cpc3-bagu10-0-0-cust651.1-3.cable.virginmedia.com)
- # [04:14] * Quits: gunderwonder (n=gunderwo@89.80-202-84.nextgentel.com) (Read error: 110 (Connection timed out))
- # [04:15] * Joins: miketaylr (n=miketayl@24.42.95.234)
- # [04:15] * Quits: miketaylr (n=miketayl@24.42.95.234) (Remote closed the connection)
- # [04:15] * Joins: miketaylr (n=miketayl@24.42.95.234)
- # [04:26] * Joins: Lachy (n=Lachlan@85.196.122.246)
- # [04:40] * Parts: bentomas (n=bentomas@c-24-9-8-90.hsd1.co.comcast.net)
- # [04:43] * Joins: workmad3 (n=davidwor@cpc3-bagu10-0-0-cust651.1-3.cable.virginmedia.com)
- # [04:47] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [04:55] * Quits: wakaba_0 (n=wakaba_@206.63.138.58.dy.bbexcite.jp) (Read error: 110 (Connection timed out))
- # [04:57] * Joins: wakaba_ (n=wakaba_@122x221x184x68.ap122.ftth.ucom.ne.jp)
- # [05:09] * Joins: riven` (n=colin@53518387.cable.casema.nl)
- # [05:12] * Quits: riven (n=colin@53518387.cable.casema.nl) (Connection reset by peer)
- # [05:12] * Joins: arronei (n=arronei@nat/microsoft/x-klwxenpiknmjwrct)
- # [05:19] * Joins: abii (n=macbook@rescomp-09-148450.Stanford.EDU)
- # [05:20] * Quits: Arron (n=arronei@nat/microsoft/x-glkjiykrceibixcx) (Read error: 110 (Connection timed out))
- # [05:47] * Quits: miketaylr (n=miketayl@24.42.95.234) ("Leaving...")
- # [06:21] * Joins: Dashimon (i=Dashiva@m223j.studby.ntnu.no)
- # [06:24] * Joins: miketaylr (n=miketayl@24.42.95.234)
- # [06:24] * Quits: miketaylr (n=miketayl@24.42.95.234) (Remote closed the connection)
- # [06:34] * Quits: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net) ("Leaving.")
- # [06:38] * Quits: Dashiva (i=Dashiva@wikia/Dashiva) (Read error: 110 (Connection timed out))
- # [06:38] * Dashimon is now known as Dashiva
- # [06:52] * Joins: paul_irish (n=paul_iri@c-71-192-163-128.hsd1.nh.comcast.net)
- # [07:00] * Quits: gavin_ (n=gavin@firefox/developer/gavin) (Read error: 110 (Connection timed out))
- # [07:00] * Joins: gavin_ (n=gavin@firefox/developer/gavin)
- # [07:18] * Joins: MikeSmith (n=MikeSmit@EM114-48-9-94.pool.e-mobile.ne.jp)
- # [07:19] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [07:19] * Quits: GPH-Laptop (n=GPHemsle@pdpc/supporter/student/GPHemsley) (Read error: 104 (Connection reset by peer))
- # [07:25] * Joins: harig (i=harig@121.245.103.44)
- # [07:42] * Joins: archtech (i=stanv@83.228.56.37)
- # [07:44] * Quits: harig (i=harig@121.245.103.44) (sendak.freenode.net irc.freenode.net)
- # [07:44] * Quits: Lachy (n=Lachlan@85.196.122.246) (sendak.freenode.net irc.freenode.net)
- # [07:44] * Quits: vvv (n=vvv@mediawiki/VasilievVV) (sendak.freenode.net irc.freenode.net)
- # [07:44] * Quits: AryehGregor (n=Simetric@mediawiki/simetrical) (sendak.freenode.net irc.freenode.net)
- # [07:45] * Joins: harig (i=harig@121.245.103.44)
- # [07:45] * Joins: Lachy (n=Lachlan@85.196.122.246)
- # [07:45] * Joins: AryehGregor (n=Simetric@mediawiki/simetrical)
- # [07:45] * Joins: vvv (n=vvv@mediawiki/VasilievVV)
- # [07:51] * Joins: jonpierce (n=jonpierc@209-6-91-231.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com)
- # [07:54] * Quits: gavin_ (n=gavin@firefox/developer/gavin) (Read error: 110 (Connection timed out))
- # [07:55] * Joins: gavin_ (n=gavin@firefox/developer/gavin)
- # [08:06] * Quits: archtech (i=stanv@83.228.56.37) (Client Quit)
- # [08:15] * Quits: vvv (n=vvv@mediawiki/VasilievVV) ("KVIrc Insomnia 4.0.0, revision: 3410, sources date: 20090703, built on: 2009/08/12 22:29:13 UTC http://www.kvirc.net/")
- # [08:23] * Quits: jonpierce (n=jonpierc@209-6-91-231.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com)
- # [08:42] * Quits: dbaron (n=dbaron@c-98-234-51-190.hsd1.ca.comcast.net) ("8403864 bytes have been tenured, next gc will be global.")
- # [09:14] * Joins: GPHemsley (n=GPHemsle@69.113.158.192)
- # [09:16] * Quits: harig (i=harig@121.245.103.44) (Read error: 145 (Connection timed out))
- # [09:40] * Joins: zcorpan (n=zcorpan@c83-252-193-59.bredband.comhem.se)
- # [09:47] * Joins: archtech (i=stanv@83.228.56.37)
- # [09:57] * Quits: zcorpan (n=zcorpan@c83-252-193-59.bredband.comhem.se) (Read error: 110 (Connection timed out))
- # [09:58] * Joins: zcorpan (n=zcorpan@c83-252-193-59.bredband.comhem.se)
- # [10:14] * Joins: Maurice (i=copyman@94.213.72.212)
- # [10:16] * Quits: zcorpan (n=zcorpan@c83-252-193-59.bredband.comhem.se) (Read error: 110 (Connection timed out))
- # [10:43] * Joins: ROBOd (n=robod@89.122.216.38)
- # [11:19] * Joins: svl (n=me@ip565744a7.direct-adsl.nl)
- # [11:26] * Joins: wakaba_0 (n=wakaba_@122x221x184x68.ap122.ftth.ucom.ne.jp)
- # [11:38] * Joins: Huvet (n=Emil@c-2fc1e555.07-131-73746f39.cust.bredbandsbolaget.se)
- # [11:38] * Quits: wakaba_ (n=wakaba_@122x221x184x68.ap122.ftth.ucom.ne.jp) (Read error: 110 (Connection timed out))
- # [11:40] * Joins: maikmerten (n=maikmert@77.132.12.215)
- # [12:05] * Quits: ciaran_lee (i=leecn@spoon.netsoc.tcd.ie) (Remote closed the connection)
- # [12:05] * Joins: ciaran_lee (i=leecn@134.226.83.42)
- # [12:09] * Quits: Rik|work (n=Rik|work@fw01d.skyrock.net) (Connection reset by peer)
- # [12:31] * Quits: nessy (n=Adium@203-214-159-50.dyn.iinet.net.au) ("Leaving.")
- # [12:46] * Joins: Rik` (n=Rik`@81.57.187.57)
- # [12:48] <Philip`> Huvet: 0.11 is very old - you should try it with the latest source version
- # [12:55] <Huvet> thanks, I will
- # [13:02] * Joins: Michelangelo (n=Michelan@93-42-96-106.ip86.fastwebnet.it)
- # [13:10] * Joins: mlpug (n=mlpug@a88-115-164-40.elisa-laajakaista.fi)
- # [13:20] * Joins: jonpierce (n=jonpierc@209.6.91.231)
- # [13:22] <Huvet> gah, "hg" needed to download the latest source version? what happened to the good old svn days :(
- # [13:25] <Philip`> The good old svn days turned into the better new hg days
- # [13:26] <Philip`> It's basically the same as SVN except you use the command "hg" instead of "svn" :-)
- # [13:26] * Quits: jonpierce (n=jonpierc@209.6.91.231)
- # [13:26] <Philip`> ...although I suppose it might be a bit more painful on Windows
- # [13:35] <Huvet> well, not really, seems to work exactly like it should
- # [13:37] <Huvet> hmm... strange, it checked out the whole tree, even though I requested a subdirectory
- # [13:38] * Quits: archtech (i=stanv@83.228.56.37) (No route to host)
- # [13:41] <Huvet> hmm... "... you cannot check out only one directory of a repository"
- # [13:41] * Joins: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net)
- # [13:43] * Quits: MikeSmith (n=MikeSmit@EM114-48-9-94.pool.e-mobile.ne.jp) (Read error: 110 (Connection timed out))
- # [13:47] <Huvet> hmm... I guess I can't clone the default repository and use that? seems that is 0.11 still. Maybe the 0.2 branch? *figures out how to clone a branch*
- # [13:51] <Huvet> is that the latest version? or should I look into some other branch?
- # [13:54] <Huvet> ah, fuck it, beautifulsoup seems deprecated anyways
- # [13:55] <Philip`> Huvet: Yeah, Hg doesn't support partial checkouts - you just clone the entire repository
- # [13:55] <Huvet> yeah, I figured that out
- # [13:55] <Philip`> which includes all the branches and everything
- # [13:55] <Huvet> ah
- # [13:56] <Huvet> how do I know which the latest branch is?
- # [13:56] <Philip`> You should just use the default branch
- # [13:56] <Huvet> ok
- # [13:56] <Philip`> since the others were for temporary experiments
- # [13:57] <Philip`> I think the BS code is still included and should work better than the 0.11 release, though I could be wrong about that
- # [13:58] <Huvet> seems I still get the same error there
- # [13:58] <Philip`> but there are fundamental problems in BS that mean it can't work properly in html5lib, and nobody has been interested in spending a great deal of effort on it
- # [13:58] <Huvet> but with an extra DataLossWarning
- # [13:58] <Huvet> I'll just use something else then I guess
- # [13:59] <Philip`> Okay, so maybe it doesn't work much better than the 0.11 release :-(
- # [14:00] <Philip`> lxml is usually the recommended treebuilder
- # [14:01] <Huvet> ok, i saw the remark in the docs about lxml being an "excellent library" :)
- # [14:01] <Huvet> or something in those terms
- # [14:08] <Huvet> oh great, the lxml parser crashes on those sites too :(
- # [14:08] <Philip`> Hmm, seems to work okay for me with lxml
- # [14:09] <Philip`> (I can't test BS yet since I don't have it installed)
- # [14:09] <Huvet> are you parsning http://www.allehanda.se ?
- # [14:10] <Huvet> http://dpaste.com/123628/
- # [14:10] <Philip`> No, because that timed out when I first tried downloading it
- # [14:10] <Philip`> but now I see the problem :-/
- # [14:12] <Philip`> ihatexml.py lives up to its name
- # [14:12] <Huvet> heh, great name for a file, what does it do?
- # [14:12] * Joins: gratz|home (n=gratz@81.106.148.238)
- # [14:15] <Philip`> http://code.google.com/p/html5lib/issues/detail?id=125
- # [14:15] <Huvet> ah, that seems it
- # [14:15] <Philip`> It tries to modify the names returned by the HTML parser so they're compatible with APIs that enforce XML's name requirements
- # [14:16] <Philip`> (and similar things)
- # [14:20] <Philip`> Huvet: <a><div><div><a> seems to be the pattern the BS treebuilder dislikes
- # [14:21] <Huvet> heh, I can understand that
- # [14:23] <Philip`> Huvet: It's the same as http://code.google.com/p/html5lib/issues/detail?id=80
- # [14:23] <Huvet> ah, good detective work
- # [14:24] * Philip` should have remembered it sooner because he looked into that bug when it was new
- # [14:24] * Joins: harig (i=HariG@121.245.108.149)
- # [14:25] <Philip`> (At least that's the problem on www.unt.se, I assume the others are the same)
- # [14:26] * Joins: jonpierce (n=jonpierc@64.119.130.114)
- # [14:36] * Quits: hobertoAtWork2 (n=hobertoa@gw1.mcgraw-hill.com) (Read error: 104 (Connection reset by peer))
- # [14:36] * Joins: hobertoAtWork (n=hobertoa@gw1.mcgraw-hill.com)
- # [14:48] * Joins: MikeSmith (n=MikeSmit@114.49.0.152)
- # [15:00] * Quits: jonpierce (n=jonpierc@64.119.130.114)
- # [15:12] * Quits: JoePeck (n=JoePeck@cpe-74-69-85-249.rochester.res.rr.com)
- # [15:27] * Quits: harig (i=HariG@121.245.108.149) (Read error: 104 (Connection reset by peer))
- # [15:27] * Quits: danbri (n=danbri@unaffiliated/danbri) (Read error: 113 (No route to host))
- # [15:32] * Quits: gavin_ (n=gavin@firefox/developer/gavin) (Remote closed the connection)
- # [15:33] * Joins: gavin_ (n=gavin@firefox/developer/gavin)
- # [15:38] * Joins: jonpierce (n=jonpierc@64.119.130.114)
- # [15:38] * Joins: openstandards (n=openstan@78.143.215.162)
- # [15:50] * Joins: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net)
- # [15:59] * Joins: hobertoAtWork2 (n=hobertoa@gw2.mcgraw-hill.com)
- # [16:02] * Joins: danbri (n=danbri@unaffiliated/danbri)
- # [16:03] * Quits: erlehmann (n=erlehman@1.106.113.82.net.de.o2.com) ("Ex-Chat")
- # [16:05] * Joins: vvv (n=vvv@213.181.10.212)
- # [16:09] * Quits: sebmarkbage (n=miranda@213.80.108.29) (Remote closed the connection)
- # [16:10] * Joins: hobertoAtWork3 (n=hobertoa@gw1.mcgraw-hill.com)
- # [16:14] * Quits: hobertoAtWork (n=hobertoa@gw1.mcgraw-hill.com) (Read error: 110 (Connection timed out))
- # [16:15] * Joins: Phae (n=phaeness@cpc2-acto9-0-0-cust364.brnt.cable.ntl.com)
- # [16:21] * Joins: myakura (n=myakura@p2197-ipbf7505marunouchi.tokyo.ocn.ne.jp)
- # [16:26] * Quits: hobertoAtWork2 (n=hobertoa@gw2.mcgraw-hill.com) (Read error: 110 (Connection timed out))
- # [16:41] * Joins: boogyman (n=chatzill@unaffiliated/boogyman)
- # [16:51] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [16:53] * Joins: sebmarkbage (n=miranda@213.80.108.29)
- # [16:54] * Quits: wakaba_0 (n=wakaba_@122x221x184x68.ap122.ftth.ucom.ne.jp) (Read error: 110 (Connection timed out))
- # [16:57] * Joins: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [17:02] * Quits: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net) (Read error: 110 (Connection timed out))
- # [17:13] * Joins: taf2 (n=taf2@98.117.216.229)
- # [17:42] * Joins: JoePeck (n=JoePeck@cpe-74-65-7-212.rochester.res.rr.com)
- # [17:52] * Quits: paul_irish (n=paul_iri@c-71-192-163-128.hsd1.nh.comcast.net) (Remote closed the connection)
- # [17:55] * Quits: Phae (n=phaeness@cpc2-acto9-0-0-cust364.brnt.cable.ntl.com)
- # [17:57] * Quits: Michelangelo (n=Michelan@93-42-96-106.ip86.fastwebnet.it) (Remote closed the connection)
- # [18:06] * Quits: taf2 (n=taf2@98.117.216.229)
- # [18:23] * Joins: paul_irish (n=paul_iri@64.119.130.114)
- # [18:26] * jarib_ is now known as jarib
- # [18:27] * Joins: taf2 (n=taf2@151.196.60.88)
- # [18:31] * Quits: myakura (n=myakura@p2197-ipbf7505marunouchi.tokyo.ocn.ne.jp) ("Leaving...")
- # [18:45] * Joins: dbaron (n=dbaron@c-98-234-51-190.hsd1.ca.comcast.net)
- # [18:46] * Quits: jonpierce (n=jonpierc@64.119.130.114)
- # [19:03] * Joins: erlehmann (n=erlehman@1.106.113.82.net.de.o2.com)
- # [19:18] * Quits: starjive (i=beos@81-233-16-19-no30.tbcn.telia.com) (Read error: 110 (Connection timed out))
- # [19:21] * Joins: starjive (i=beos@81-233-16-19-no30.tbcn.telia.com)
- # [19:32] * Quits: Amorphous (i=jan@unaffiliated/amorphous) (Read error: 104 (Connection reset by peer))
- # [19:32] * Quits: MikeSmith (n=MikeSmit@114.49.0.152) (Read error: 145 (Connection timed out))
- # [19:34] * Quits: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [19:37] * Joins: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [19:39] * Joins: rauchg (n=rauchg@32.177.130.23)
- # [19:50] * Joins: cohitre (n=cohitre@64-40-56-46-dsl.itltd.net)
- # [19:50] * Parts: cohitre (n=cohitre@64-40-56-46-dsl.itltd.net)
- # [19:51] * Joins: Amorphous (i=jan@unaffiliated/amorphous)
- # [19:54] * Quits: starjive (i=beos@81-233-16-19-no30.tbcn.telia.com) (Read error: 110 (Connection timed out))
- # [19:57] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net) (Read error: 110 (Connection timed out))
- # [20:14] * boogyman is now known as boog|afk
- # [20:19] * Joins: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net)
- # [20:27] * Quits: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [20:27] * Joins: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [20:34] * Joins: zalan (n=zalan@catv-89-135-144-122.catv.broadband.hu)
- # [20:46] * Quits: maikmerten (n=maikmert@77.132.12.215) (Remote closed the connection)
- # [20:54] * Quits: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net) (Read error: 145 (Connection timed out))
- # [21:07] * Joins: jonpierce (n=jonpierc@64.119.130.114)
- # [21:23] * Quits: svl (n=me@ip565744a7.direct-adsl.nl) ("And back he spurred like a madman, shrieking a curse to the sky.")
- # [21:23] * Quits: ROBOd (n=robod@89.122.216.38) ("http://www.robodesign.ro")
- # [21:32] * Quits: taf2 (n=taf2@151.196.60.88) (Read error: 131 (Connection reset by peer))
- # [21:33] * Joins: taf2 (n=taf2@static-151-196-60-88.balt.east.verizon.net)
- # [21:34] * Joins: gunderwonder (n=gunderwo@89.80-202-84.nextgentel.com)
- # [21:35] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [21:43] * Quits: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [21:44] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [21:47] * Joins: nessy (n=Adium@203-214-159-50.dyn.iinet.net.au)
- # [21:55] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [21:58] * Quits: jonpierce (n=jonpierc@64.119.130.114)
- # [21:58] <Huvet> heh, next horrendous HTML that crashes the html5 parser: http://7-harad.nu/
- # [21:59] <Philip`> What error message do you get?
- # [21:59] <Huvet> http://dpaste.com/123783/
- # [21:59] * Joins: taf2_ (n=taf2@static-151-196-60-88.balt.east.verizon.net)
- # [22:00] * Quits: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net) ("Leaving.")
- # [22:02] <Philip`> Hmm
- # [22:02] <Philip`> What treebuilder are you using?
- # [22:02] <Huvet> dom
- # [22:03] <Huvet> beautifulsoup crashed on some sites, lxml on some other ones, so I'm on dom now :)
- # [22:04] <Huvet> I guess it's all the advertising code on these sites that make them so badly formatted
- # [22:04] <Philip`> http://code.google.com/p/html5lib/issues/detail?id=123 sounds like it could be relevant
- # [22:05] <Philip`> but I'm not really sure
- # [22:05] <Philip`> It'd be good if you could produce a minimal testcase
- # [22:05] <Huvet> yeah, I'm not sure how to go about that... save the sourcecode locally and start stipping stuff out?
- # [22:05] <Philip`> by starting with the markup from the site that causes problems, then deleting half of it and seeing if the problem is still there, else delete the other half instead, and repeat until there's not much left
- # [22:06] <Philip`> Yeah, basically what you said :-)
- # [22:06] <Huvet> ok, I'll get to work right away
- # [22:06] <Huvet> :)
- # [22:07] * Joins: jonpierce (n=jonpierc@64.119.130.114)
- # [22:08] * Quits: taf2 (n=taf2@static-151-196-60-88.balt.east.verizon.net) (Read error: 110 (Connection timed out))
- # [22:21] * Quits: taf2_ (n=taf2@static-151-196-60-88.balt.east.verizon.net) (Read error: 110 (Connection timed out))
- # [22:26] * Joins: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [22:27] <Huvet> oh, there's a new error
- # [22:27] <Huvet> http://dpaste.com/123797/
- # [22:28] <Huvet> but one thing at the time
- # [22:30] <Philip`> Testing on real content is a good way to find bugs :-)
- # [22:31] * Philip` wonders how many pages Huvet is running through it
- # [22:31] <Huvet> 351 :)
- # [22:31] <Huvet> I'm scaping swedish news sites for RSS urls
- # [22:32] <Huvet> seems that's a bit harder than I first thouht :P
- # [22:32] <Huvet> seems that's a bit harder than I first thouht :
- # [22:33] <Huvet> this is the smallest I can get it: <table><td><span><font></span><span>
- # [22:33] <Huvet> first one
- # [22:33] * Quits: workmad3 (n=davidwor@cpc3-bagu10-0-0-cust651.1-3.cable.virginmedia.com)
- # [22:35] <Huvet> ehm... strange... the other error is if I have a file with just <table> in it :)
- # [22:36] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net) (Read error: 145 (Connection timed out))
- # [22:39] <Philip`> That's quite minimal :-)
- # [22:39] * Joins: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [22:39] * Quits: mlpug (n=mlpug@a88-115-164-40.elisa-laajakaista.fi) (Remote closed the connection)
- # [22:43] <Philip`> Huvet: I think you could fix the processEOF easily by removing the 'token' in html5parser.py lines 1689, 1692 (the processEOF declaration/call)
- # [22:44] <Philip`> but it'd be good to post a new issue on the Google Code site, so someone can add a test case and fix the code and make sure it works
- # [22:45] <Philip`> and also for the other bug (which looks like a scary adoption agency thing)
- # [22:46] <Huvet> I will
- # [22:48] * Joins: tndH (n=Rob@cpc2-leed18-0-0-cust427.leed.cable.ntl.com)
- # [23:02] * riven` is now known as riven
- # [23:04] <Huvet> here's the first bug: http://code.google.com/p/html5lib/issues/detail?id=126
- # [23:05] * Quits: KrocCamen (n=kroc@cpc3-lanc2-0-0-cust544.brig.cable.ntl.com)
- # [23:09] * Joins: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net)
- # [23:09] <Huvet> and here's the other one: http://code.google.com/p/html5lib/issues/detail?id=127
- # [23:11] * Joins: ttepasse (n=ttepas--@dslb-084-060-060-034.pools.arcor-ip.net)
- # [23:11] * Parts: cpharmston (n=cpharmst@pool-173-66-156-203.washdc.fios.verizon.net)
- # [23:12] <AryehGregor> "Such a subset does not, in general, include inline script elements."
- # [23:12] * Quits: rauchg (n=rauchg@32.177.130.23) (Read error: 110 (Connection timed out))
- # [23:12] <AryehGregor> Why can't you include inline script in polyglots? Can't you fudge things using <!CDATA[ or whatnot?
- # [23:15] * Quits: dglazkov (n=dglazkov@c-67-188-0-62.hsd1.ca.comcast.net)
- # [23:22] * Quits: Maurice (i=copyman@94.213.72.212)
- # [23:31] * Joins: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net)
- # [23:45] * Quits: fishd_ (n=darin@c-98-207-16-168.hsd1.ca.comcast.net) (Read error: 145 (Connection timed out))
- # [23:52] * Quits: zalan (n=zalan@catv-89-135-144-122.catv.broadband.hu) (Read error: 110 (Connection timed out))
- # [23:57] * Quits: dbaron (n=dbaron@c-98-234-51-190.hsd1.ca.comcast.net) ("8403864 bytes have been tenured, next gc will be global.")
- # Session Close: Mon Nov 23 00:00:00 2009
The end :)