Options:
- # Session Start: Wed May 21 00:00:00 2014
- # Session Ident: #webplatform
- # [00:00] * Quits: @julee (~Adium@192.150.10.210) (Quit: Leaving.)
- # [00:03] * Joins: julee (~Adium@192.150.10.210)
- # [00:03] * ChanServ sets mode: +o julee
- # [00:17] * Quits: drublic (~drublic@xdsl-87-78-102-91.netcologne.de) (Remote host closed the connection)
- # [00:27] * Quits: +eliezerb (uid25062@gateway/web/irccloud.com/x-mtgsqbxbuesaebcd) (Quit: Connection closed for inactivity)
- # [00:30] * Joins: kiy (~kiyoura@pool-173-79-97-128.washdc.fios.verizon.net)
- # [00:38] * Quits: AmeliaBR (3263c548@gateway/web/freenode/ip.50.99.197.72) (Quit: Page closed)
- # [00:55] * Quits: roven (~roven@78-20-24-80.access.telenet.be) (Remote host closed the connection)
- # [01:04] * Quits: David_Bradbury (~chatzilla@75-147-178-254-Washington.hfc.comcastbusiness.net) (Quit: ChatZilla 0.9.90.1 [Firefox 29.0.1/20140506152807])
- # [01:05] * Quits: @julee (~Adium@192.150.10.210) (Quit: Leaving.)
- # [01:42] * Joins: julee (~Adium@192.150.10.210)
- # [01:42] * ChanServ sets mode: +o julee
- # [01:46] * Quits: @julee (~Adium@192.150.10.210) (Ping timeout: 252 seconds)
- # [01:48] * Quits: @Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane) (Quit: Leaving.)
- # [01:52] * Joins: Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane)
- # [01:52] * ChanServ sets mode: +o Ryan_Lane
- # [01:55] * Joins: ryuan (~ryuan@210.94.41.89)
- # [01:59] * Quits: lmclister (~lmclister@192.150.10.210)
- # [01:59] * Quits: @Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane) (Quit: Leaving.)
- # [02:00] * Joins: Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane)
- # [02:00] * ChanServ sets mode: +o Ryan_Lane
- # [02:04] * DenSchub is now known as offSchub
- # [02:34] * Quits: jswisher (~jswisher@cpe-72-182-94-57.austin.res.rr.com) (Quit: jswisher)
- # [02:53] * Quits: @Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane) (Quit: Leaving.)
- # [02:56] * Joins: roven (~roven@78-20-24-80.access.telenet.be)
- # [02:57] * Joins: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [03:01] * Quits: roven (~roven@78-20-24-80.access.telenet.be) (Ping timeout: 258 seconds)
- # [03:14] * Joins: karlcow (~karl@nerval.la-grange.net)
- # [03:59] * Quits: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [04:04] * Joins: eliezerb (uid25062@gateway/web/irccloud.com/x-mkwuhgrzqoqbgdzr)
- # [04:04] * ChanServ sets mode: +v eliezerb
- # [04:05] <+eliezerb> renoirb: wow! No more crazy jobs! \o/
- # [04:06] * Quits: karlcow (~karl@nerval.la-grange.net) (Quit: This computer has gone to sleep)
- # [04:11] * Quits: vanessametonini (~vanessame@5.55.net.registro.br) (Remote host closed the connection)
- # [04:38] * Joins: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [04:57] * Joins: roven (~roven@78-20-24-80.access.telenet.be)
- # [05:01] * Quits: roven (~roven@78-20-24-80.access.telenet.be) (Ping timeout: 240 seconds)
- # [05:02] * Joins: karlcow (~karl@nerval.la-grange.net)
- # [05:03] * Quits: karlcow (~karl@nerval.la-grange.net) (Remote host closed the connection)
- # [05:03] * Joins: karlcow (~karl@nerval.la-grange.net)
- # [05:10] * Quits: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent) (Ping timeout: 256 seconds)
- # [05:24] * Joins: hyperair (~hyperair@ubuntu/member/hyperair)
- # [05:24] * Quits: ckwalsh (~ckwalsh@facebook/engineering/ckwalsh) (Remote host closed the connection)
- # [05:38] * Quits: hyperair (~hyperair@ubuntu/member/hyperair) (Ping timeout: 255 seconds)
- # [05:43] * Joins: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent)
- # [05:47] * Joins: hyperair (~hyperair@ubuntu/member/hyperair)
- # [05:52] * Joins: java_expert (ba52dc30@gateway/web/freenode/ip.186.82.220.48)
- # [05:52] <java_expert> hello
- # [05:53] <java_expert> hola
- # [05:53] <java_expert> hola
- # [05:53] <java_expert> hola
- # [05:53] <java_expert> ninguno por ahi
- # [05:58] * Quits: java_expert (ba52dc30@gateway/web/freenode/ip.186.82.220.48) (Ping timeout: 240 seconds)
- # [06:01] * Quits: jerryitt (uid17132@gateway/web/irccloud.com/x-iqrondqcnxdhkcfl) (Quit: Connection closed for inactivity)
- # [06:46] * Quits: Rastus_Vernon (uid15187@wikimedia/Rastus-Vernon) (Quit: Connection closed for inactivity)
- # [06:49] * Quits: benschwarz_ (sid2121@gateway/web/irccloud.com/x-hrrpuokenukxdomr) (Ping timeout: 276 seconds)
- # [06:50] * Joins: benschwarz_ (sid2121@gateway/web/irccloud.com/x-lwfgtktwfrrisvfb)
- # [07:47] * Quits: +eliezerb (uid25062@gateway/web/irccloud.com/x-mkwuhgrzqoqbgdzr) (Quit: Connection closed for inactivity)
- # [07:49] * Quits: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [08:09] * Joins: ptressel (~chatzilla@174-31-242-8.tukw.qwest.net)
- # [08:30] * Joins: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [08:40] * Quits: kiy (~kiyoura@pool-173-79-97-128.washdc.fios.verizon.net) (Read error: Connection reset by peer)
- # [08:40] * Quits: karlcow (~karl@nerval.la-grange.net) (Quit: :tiuQ tiuq sah woclrak)
- # [08:40] * Joins: karlcow (~karl@nerval.la-grange.net)
- # [08:55] * Quits: lmclister (~lmclister@c-98-210-38-110.hsd1.ca.comcast.net)
- # [08:58] * Joins: roven (~roven@78-20-24-80.access.telenet.be)
- # [09:04] * Quits: roven (~roven@78-20-24-80.access.telenet.be) (Ping timeout: 252 seconds)
- # [09:04] * Quits: @_cheney (~cheney@nat.sierrabravo.net) (Read error: Connection reset by peer)
- # [09:05] * Joins: _cheney (~cheney@nat.sierrabravo.net)
- # [09:05] * ChanServ sets mode: +o _cheney
- # [09:16] * Joins: mattweb_de (~mattweb_d@pd95699f8.dip0.t-ipconnect.de)
- # [09:19] * Joins: drublic (~drublic@213.15.0.85)
- # [09:24] * Joins: antdillon (~ant@nat/canonical/x-rporkjeklrourlrj)
- # [09:42] * Joins: roven (~roven@78-20-24-80.access.telenet.be)
- # [10:03] * Joins: mattweb_de_ (~mattweb_d@pd95699f8.dip0.t-ipconnect.de)
- # [10:05] * Quits: mattweb_de (~mattweb_d@pd95699f8.dip0.t-ipconnect.de) (Ping timeout: 276 seconds)
- # [10:05] * mattweb_de_ is now known as mattweb_de
- # [10:08] * Quits: ptressel (~chatzilla@174-31-242-8.tukw.qwest.net) (Read error: Connection reset by peer)
- # [10:16] * Joins: mstalfoort (~manuchill@83.232.96.217)
- # [10:22] * Quits: ryuan (~ryuan@210.94.41.89) (Remote host closed the connection)
- # [10:51] * Joins: auchenberg (~auchenber@94.18.214.22)
- # [11:15] * Joins: ink|off|ZNC (~inky@master.qs.biz)
- # [11:37] * Quits: tfnico (sid1523@gateway/web/irccloud.com/x-isyyvsgckilbufud) (Ping timeout: 245 seconds)
- # [11:37] * Joins: ptressel (~chatzilla@174-31-242-8.tukw.qwest.net)
- # [11:37] * Quits: Kenzi` (sid7017@gateway/web/irccloud.com/x-vdyzvhoiidhywuit) (Ping timeout: 245 seconds)
- # [11:37] * Quits: benschwarz_ (sid2121@gateway/web/irccloud.com/x-lwfgtktwfrrisvfb) (Read error: Connection reset by peer)
- # [11:37] * Joins: benschwarz_ (sid2121@gateway/web/irccloud.com/x-xhaeqxquxpiokswu)
- # [11:39] * Joins: Kenzi` (sid7017@gateway/web/irccloud.com/x-tssfcjiofrpvegrz)
- # [11:39] * Joins: tfnico (sid1523@gateway/web/irccloud.com/x-kroncydhqvfwbyve)
- # [11:44] * Quits: wpdbot (~wpdbot@ec2-50-19-180-183.compute-1.amazonaws.com) (Remote host closed the connection)
- # [11:45] * Joins: wpdbot (~wpdbot@ec2-23-22-142-26.compute-1.amazonaws.com)
- # [11:47] * Quits: tfnico (sid1523@gateway/web/irccloud.com/x-kroncydhqvfwbyve) (Ping timeout: 264 seconds)
- # [11:47] * Joins: tfnico (sid1523@gateway/web/irccloud.com/x-zpgtpiidwiydfmmc)
- # [11:49] * Quits: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent) (Ping timeout: 256 seconds)
- # [11:51] * Joins: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent)
- # [12:24] * Quits: auchenberg (~auchenber@94.18.214.22) (Remote host closed the connection)
- # [12:24] * Quits: ptressel (~chatzilla@174-31-242-8.tukw.qwest.net) (Quit: zzz)
- # [12:29] * Joins: chrismills (~chrismill@87.115.156.125)
- # [12:29] * ChanServ sets mode: +o chrismills
- # [12:35] * Joins: eliezerb (uid25062@gateway/web/irccloud.com/x-kswgdusrcmyfeomq)
- # [12:35] * ChanServ sets mode: +v eliezerb
- # [12:50] * Joins: auchenberg (~auchenber@94.18.214.22)
- # [13:50] * Joins: auchenbe_ (~auchenber@94.18.214.22)
- # [13:50] * Quits: auchenberg (~auchenber@94.18.214.22) (Read error: Connection reset by peer)
- # [13:54] * Quits: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent) (Ping timeout: 256 seconds)
- # [15:13] * Joins: jswisher (~jswisher@cpe-72-182-94-57.austin.res.rr.com)
- # [15:45] * Joins: jerryitt (uid17132@gateway/web/irccloud.com/x-jidigtgjolflkaps)
- # [15:47] * Quits: auchenbe_ (~auchenber@94.18.214.22) (Remote host closed the connection)
- # [15:52] * Joins: auchenberg (~auchenber@176.222.239.226)
- # [16:10] * Joins: Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane)
- # [16:10] * ChanServ sets mode: +o Ryan_Lane
- # [16:12] * Quits: @Ryan_Lane (~Ryan_Lane@wikimedia/Ryan-lane) (Client Quit)
- # [16:17] * Quits: +eliezerb (uid25062@gateway/web/irccloud.com/x-kswgdusrcmyfeomq) (Quit: Connection closed for inactivity)
- # [16:35] * Joins: dontcallmedom (~dom@216.239.55.62)
- # [16:41] * Joins: codylindley (~textual@184-155-250-216.cpe.cableone.net)
- # [17:07] * Quits: auchenberg (~auchenber@176.222.239.226) (Remote host closed the connection)
- # [17:08] * Quits: hyperair (~hyperair@ubuntu/member/hyperair) (Ping timeout: 240 seconds)
- # [17:10] * Joins: auchenberg (~auchenber@176.222.239.226)
- # [17:11] * Quits: jswisher (~jswisher@cpe-72-182-94-57.austin.res.rr.com) (Ping timeout: 264 seconds)
- # [17:12] * Joins: auchenbe_ (~auchenber@176.222.239.226)
- # [17:15] * Quits: auchenberg (~auchenber@176.222.239.226) (Ping timeout: 276 seconds)
- # [17:26] * Joins: jswisher (~jswisher@cpe-72-182-94-57.austin.res.rr.com)
- # [17:45] * Joins: eliezerb (uid25062@gateway/web/irccloud.com/x-wlnlnvjxzgefidtp)
- # [17:45] * ChanServ sets mode: +v eliezerb
- # [17:46] * Quits: auchenbe_ (~auchenber@176.222.239.226) (Remote host closed the connection)
- # [17:47] * Quits: mattweb_de (~mattweb_d@pd95699f8.dip0.t-ipconnect.de) (Quit: mattweb_de)
- # [17:52] * Quits: karlcow (~karl@nerval.la-grange.net) (Ping timeout: 258 seconds)
- # [17:54] * Joins: karlcow (~karl@nerval.la-grange.net)
- # [17:57] * Quits: jswisher (~jswisher@cpe-72-182-94-57.austin.res.rr.com) (Quit: jswisher)
- # [17:59] * Joins: hyperair (~hyperair@ubuntu/member/hyperair)
- # [18:02] * Joins: lmclister (~lmclister@192.150.10.210)
- # [18:07] * Quits: drublic (~drublic@213.15.0.85) (Remote host closed the connection)
- # [18:11] * Quits: mstalfoort (~manuchill@83.232.96.217) (Quit: kthxbai)
- # [18:19] * Joins: julee (~Adium@c-50-184-87-81.hsd1.ca.comcast.net)
- # [18:19] * ChanServ sets mode: +o julee
- # [18:19] * Quits: @julee (~Adium@c-50-184-87-81.hsd1.ca.comcast.net) (Client Quit)
- # [18:21] * Joins: julee (~Adium@192.150.10.203)
- # [18:21] * ChanServ sets mode: +o julee
- # [18:43] * Quits: @chrismills (~chrismill@87.115.156.125) (Quit: Off to find beer and rock and roll...)
- # [18:59] * Quits: antdillon (~ant@nat/canonical/x-rporkjeklrourlrj) (Quit: Leaving)
- # [19:16] * Quits: lmclister (~lmclister@192.150.10.210)
- # [19:21] * Joins: lmclister (~lmclister@192.150.10.210)
- # [19:27] * Joins: David_Bradbury (~chatzilla@75-147-178-254-Washington.hfc.comcastbusiness.net)
- # [19:33] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [19:47] * Joins: Bad_Advice_Cat (~Moai@unaffiliated/featheredserpent)
- # [19:55] * Joins: ckwalsh (~ckwalsh@facebook/engineering/ckwalsh)
- # [20:11] * offSchub is now known as DenSchub
- # [20:34] * Parts: ink|off|ZNC (~inky@master.qs.biz)
- # [20:57] * Quits: David_Bradbury (~chatzilla@75-147-178-254-Washington.hfc.comcastbusiness.net) (Quit: ChatZilla 0.9.90.1 [Firefox 29.0.1/20140506152807])
- # [21:03] * Quits: karlcow (~karl@nerval.la-grange.net) (Ping timeout: 240 seconds)
- # [21:05] * Joins: _cheney_ (~cheney@nat.sierrabravo.net)
- # [21:07] * Quits: dontcallmedom (~dom@216.239.55.62) (Ping timeout: 240 seconds)
- # [21:08] * Quits: @_cheney (~cheney@nat.sierrabravo.net) (Ping timeout: 240 seconds)
- # [21:09] * Joins: mattweb_de (~mattweb_d@cable-78-34-4-198.netcologne.de)
- # [21:56] * Quits: m4nu (~manu@216.252.204.51) (Ping timeout: 276 seconds)
- # [21:58] * Joins: ptressel (~chatzilla@174-31-242-8.tukw.qwest.net)
- # [22:01] * Joins: manu (~manu@216.252.204.51)
- # [22:01] * manu is now known as Guest41481
- # [22:02] * Guest41481 is now known as m4nu
- # [22:03] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Remote host closed the connection)
- # [22:04] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [22:09] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Ping timeout: 265 seconds)
- # [22:11] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [22:16] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Ping timeout: 265 seconds)
- # [22:17] * Quits: +eliezerb (uid25062@gateway/web/irccloud.com/x-wlnlnvjxzgefidtp) (Quit: Connection closed for inactivity)
- # [22:24] <@shepazu> frozenice, yt?
- # [22:25] <@frozenice> hi!
- # [22:25] <@shepazu> whoah!
- # [22:25] <@shepazu> fast response
- # [22:25] <@frozenice> working on the irc bot :)
- # [22:25] <@shepazu> now I don't recall what I wanted to say… jk
- # [22:25] <@frozenice> it's about MDN, I imagine
- # [22:25] <@shepazu> frozenice, any chance you could help with the MDN crawling/scraping?
- # [22:26] <@shepazu> Pat seems to be busy lately, or maybe I'm just confused
- # [22:26] <@shepazu> in any case, I'm stressing out about the compat-table stuff
- # [22:27] <@shepazu> and from what I understood, you had some of the scrape-bot stuff working already
- # [22:27] <@shepazu> in your NodeJS thingie
- # [22:28] <@frozenice> yeah that works, it fetches feeds from some tags (HTML, HTML5, CSS, etc.) but it can only get 500 pages or so from those feeds, that was the problem
- # [22:28] <@shepazu> frozenice, any way around that?
- # [22:29] <@frozenice> none that I saw, we somehow need to get us a list of all the pages, then the bot can run with that and pick out the compat tables from each page
- # [22:30] <@frozenice> I just did the feed stuff to get a pool of useful pages
- # [22:30] <@shepazu> frozenice, that's the crawler aspect, right?
- # [22:30] <@shepazu> surely there's a node crawler out there...
- # [22:30] <@frozenice> well, kinda
- # [22:31] <@frozenice> we just need a list of pages, the thing can do the rest :)
- # [22:31] <@frozenice> maybe MDN has one, in a sitemap or something
- # [22:31] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [22:31] <@shepazu> frozenice, I'm sure they do
- # [22:32] <@shepazu> frozenice, if I collect the list of pages, will your scraper do the rest?
- # [22:32] <@frozenice> yup
- # [22:32] <@frozenice> the "getting a list of pages" is just one step, we can change how it gets that list
- # [22:33] <@shepazu> frozenice, ok, I'll get on that
- # [22:33] <@frozenice> neat
- # [22:33] <@frozenice> btw, I started working on the irc bot again yesterday, making progress :)
- # [22:33] * Joins: drublic (~drublic@xdsl-87-78-27-195.netcologne.de)
- # [22:34] <@shepazu> frozenice, what will it do?
- # [22:34] <@frozenice> everything!
- # [22:34] <@shepazu> and what specific part are you working on?
- # [22:34] <@frozenice> everything!
- # [22:34] <@shepazu> everything? good, I could use someone to help clean up my house
- # [22:34] <@frozenice> your house is no part of the bot :P
- # [22:35] <@frozenice> unless someone would write a plugin...
- # [22:35] <@shepazu> frozenice, one thing I'd like it to do is make tidy transcripts from meeting minutes
- # [22:35] <@shepazu> and record actions
- # [22:36] <@frozenice> yeah, I got some ideas for plugins, too
- # [22:36] <@frozenice> the current bot (old code) has some of those
- # [22:38] <@frozenice> the core seems pretty finished for the time being, I'm putting together the actual bot we will use, adding some plugins (like watching for wiki changes), testing the whole plugin system etc.
- # [22:41] <@frozenice> I've been kinda absent, because a colleague / friend suddenly passed away two weeks ago, that sucked very much... but the irc bot is good for getting back to work
- # [22:45] <@shepazu> oh, wow, sorry to hear that!
- # [22:45] <@shepazu> that's terrible.
- # [22:48] <@frozenice> yeah, my job got a bit stressier, but we'll manage, the show must go on :)
- # [22:49] <@frozenice> stressier = more stressful
- # [22:49] <@frozenice> I claim that word
- # [22:51] <@shepazu> can I rent that word from you?
- # [22:51] <@frozenice> only if you don't pay me!
- # [22:56] <@frozenice> well, let's see how far I can get the bot this week.
- # [22:57] <@shepazu> frozenice, ok, I have a list of all the CSS property pages that we want… I can do the same for HTML, SVG, etc.
- # [22:57] <ptressel> Hi, shepazu
- # [22:57] <@shepazu> is that really all we need?
- # [22:57] <@shepazu> hi, ptressel!
- # [22:58] <ptressel> I had windows of time to work on scrapng -- another will open up starting today.
- # [22:58] <@shepazu> ptressel, ok, great
- # [22:59] <@shepazu> ptressel, I'm not sure we actually need to scrape, based on what frozenice said… confirming now
- # [22:59] <@shepazu> or rather, we don't need to crawl, sorry
- # [22:59] <ptressel> Ok
- # [23:00] <@frozenice> shepazu: cool!
- # [23:00] <@frozenice> hi ptressel :)
- # [23:00] <@frozenice> yep, a list of pages should be enough
- # [23:01] <ptressel> Ok, I'm totally confused now.
- # [23:01] <@shepazu> well, shucks, I can do that tonight
- # [23:01] <ptressel> Just read the chat backlog.
- # [23:01] <ptressel> The issue was that the tag lists cut off at a fixed number.
- # [23:01] <@frozenice> correct
- # [23:01] <ptressel> So the point of crawling was to get around that.
- # [23:02] <ptressel> Yes, there are node.js crawler libraries.
- # [23:02] <ptressel> We got stuck on nutch for a while.
- # [23:02] <@shepazu> ptressel, yeah, but their topic index pages have the full list of pages we want :) https://developer.mozilla.org/en-US/docs/Web/CSS/Reference
- # [23:02] <ptressel> Yes, those are the seed pages.
- # [23:02] <ptressel> The point of the crawler is that it fetches them.
- # [23:02] <@shepazu> renoirb and I just "scraped" all the URLs for that
- # [23:03] <ptressel> I have the list of seed pages.
- # [23:03] <@shepazu> ptressel, yeah, isn't that what frozenice's script does?
- # [23:03] <@shepazu> maybe I'm confused
- # [23:03] <ptressel> If that's different from what's in the node.js work, then I don't know.
- # [23:03] <ptressel> The *tag* request is different.
- # [23:04] <ptressel> That is a specific MDN query that returns a fixed max number of pages having a particular tag.
- # [23:04] <@frozenice> yeah the getting a page list via the tag-feeds was my way to get us started on some useful pages
- # [23:05] <@frozenice> we can change the importer, so it pulls the page list from elsewhere
- # [23:05] <ptressel> Anyhow, I'm going to a meetup tonight where there is a node.js expert.
- # [23:05] <@frozenice> nice
- # [23:05] <ptressel> But if this is done, then I'll work on something else.
- # [23:06] <ptressel> So...done? or not done?
- # [23:06] <@shepazu> ptressel, frozenice, I want to make sure I'm not confused
- # [23:06] <ptressel> I'm totally confused at the moment.
- # [23:06] <@frozenice> :D
- # [23:06] <@shepazu> yay! me too!
- # [23:06] <ptressel> :P
- # [23:06] <@shepazu> ptressel, don't worry, there's plenty more you could do, if you want :)
- # [23:07] <@shepazu> frozenice, ok, sorry to be pedantic...
- # [23:07] <@shepazu> but just to confirm:
- # [23:07] <ptressel> shepazu has another confusion: We've met at the TTWF event in Seattle. I'm a "she" not a "he".
- # [23:07] <ptressel> :D
- # [23:07] <@frozenice> that has also confused me.
- # [23:07] <ptressel> The Pat is for Patricia
- # [23:07] <@shepazu> gah!!!!!
- # [23:07] <ptressel> :D
- # [23:07] <@frozenice> I always get bad luck on names which can be both :P
- # [23:08] <ptressel> :D
- # [23:08] <@shepazu> ptressel, I have a sister named Pat, that's why I was confused… I understand now that frozenice is a woman, despite the name "David"
- # [23:08] <@frozenice> wat
- # [23:08] <@shepazu> now we're all clear, sorry
- # [23:08] <@frozenice> that would be news to me
- # [23:09] <@frozenice> the confusion seems to be spreading
- # [23:09] <@shepazu> frozenice, ok...
- # [23:09] <@shepazu> 1) the reason we weren't getting all the pages was that we didn't have a complete list of pages to scrape
- # [23:10] <@shepazu> 2) if we have a full list of content pages, we can extract the compat tables from each of them with your script
- # [23:10] <@shepazu> 3) ptressel has the complete list of pages we want
- # [23:10] <ptressel> What's the script?
- # [23:10] <ptressel> No, getting the pages is what the crawl is for.
- # [23:11] <@shepazu> 4) frozenice is male, ptressel is female, shepazu is male and confused
- # [23:11] <@shepazu> the script is the nodejs thingie
- # [23:11] <ptressel> What I have are the seed pages -- those are the tables of contents you mentioned, plus a few obscure ones.
- # [23:11] <ptressel> Ok
- # [23:11] <@shepazu> ptressel, I think the seed pages contain all the URLs we want
- # [23:11] <ptressel> The nice thing about a real crawler is that it doesn't annoy their sysadmins.
- # [23:12] * Quits: mattweb_de (~mattweb_d@cable-78-34-4-198.netcologne.de) (Quit: mattweb_de)
- # [23:12] <@frozenice> I have proof for 4) https://www.flickr.com/photos/szene/8459312560/in/set-72157632724112919 directly under the 'W'
- # [23:12] <ptressel> It obeys robots.txt, doesn't fetch too rapidly, etc.
- # [23:12] <@shepazu> ptressel, you want to send me your list of seed pages?
- # [23:12] <ptressel> Let me dig them out...
- # [23:13] <@shepazu> ptressel, that just proves you have long hair, I've had long hair!
- # [23:13] <@shepazu> heck, look at the bearded weirdo next to you, he has longer hair than you!
- # [23:14] <@frozenice> wtf are you talking about shepazu :D
- # [23:14] <@shepazu> ptressel, sorry I didn't remember you, I'm bad with names
- # [23:14] <@shepazu> frozenice, I think I might not be sure anymore
- # [23:15] <@frozenice> you know that's jswisher in the center of that photo, right?
- # [23:15] <@shepazu> yes, and Chris Mills to the right
- # [23:16] <@frozenice> yes
- # [23:16] <@shepazu> and ptressel the the left, IIUI
- # [23:16] <@frozenice> no
- # [23:16] <@shepazu> with her back turned
- # [23:16] <@shepazu> oh… she said "under the W"
- # [23:16] <@frozenice> I SAID THIS :D
- # [23:16] <ptressel> :D
- # [23:16] <@shepazu> wtf????
- # [23:17] <@shepazu> I am going blind and insane
- # [23:17] <ptressel> "He said this"
- # [23:17] <@frozenice> on Janet's right is Flo (also from MDN, with orange lanyard) and to his right it's me, under the 'W'
- # [23:17] <@shepazu> I think I might need to stop drinking so much petroleum
- # [23:17] <@frozenice> or drink more
- # [23:17] <@shepazu> at least on work days
- # [23:18] <ptressel> Seed pages: http://pastebin.ubuntu.com/7499120/
- # [23:18] <@shepazu> frozenice, you are 2 down from Janet?
- # [23:18] <@frozenice> yeah, with orange lanyard
- # [23:18] <ptressel> The short list at the top is good enough for a depth 2 or 3 crawl.
- # [23:18] <@frozenice> inbetween me an Janet is Florian Scholz
- # [23:19] <@shepazu> you look male, true, and I'm willing to take your word for it… but I don't consider that photo proof, you're covering your face
- # [23:19] <@frozenice> it's a fact and that should clear up article 4) subsection 1. :)
- # [23:19] <@shepazu> but I'm not judgmental, you can be whatever sex you want
- # [23:19] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Remote host closed the connection)
- # [23:19] <ptressel> I'm not in that pic.
- # [23:20] <ptressel> :D
- # [23:20] <ptressel> So there's no evidence there re my gender. ;-)
- # [23:20] <@frozenice> the girl in purple is User:Vivienne, IIRC
- # [23:21] <@frozenice> uhm, what I was actually wanting to say
- # [23:21] <@frozenice> the CSS/Reference page is maybe a good start, but as ptressel said there are some weird pages, which could be discovered through crawling
- # [23:22] <@shepazu> frozenice, would this be a reasonable start?
- # [23:22] <@shepazu> http://pastebin.ubuntu.com/7499135/
- # [23:22] <@frozenice> spotted 2 external links
- # [23:23] <@shepazu> frozenice, yeah, minus those and a few others
- # [23:23] <@frozenice> it's good for a start, yeah
- # [23:23] <@frozenice> I wonder how many of those <500 we already have :)
- # [23:23] <@shepazu> frozenice, and who knows how to run your script?
- # [23:23] <ptressel> Some of those urls are dups with fragments
- # [23:23] <@frozenice> there was a thread in the ML, I believe
- # [23:25] <@frozenice> shepazu: http://lists.w3.org/Archives/Public/public-webplatform/2014Jan/0030.html
- # [23:26] <@frozenice> the README in https://github.com/webplatform/mdn-compat-importer has some more instructions
- # [23:27] <ptressel> There are changes since the last version I pulled, I think.
- # [23:28] <@shepazu> ok
- # [23:28] <@frozenice> renoirb has done some work on the conversion and some meta-stuff
- # [23:28] <@shepazu> yeah
- # [23:28] <@shepazu> ok, here's my plan
- # [23:28] <@shepazu> I'm going to find all the pages we want (or at least most of them)
- # [23:29] <@shepazu> using ptressel's seed pages to inform that list
- # [23:29] <@shepazu> I'll make a master list
- # [23:29] <ptressel> You're gong to crawl by hand? :D
- # [23:29] <@shepazu> then compare those pages to the ones we already got results for
- # [23:29] * Quits: codylindley (~textual@184-155-250-216.cpe.cableone.net) (Quit: ["Textual IRC Client: www.textualapp.com"])
- # [23:29] <@shepazu> and remove the dupes
- # [23:30] <@shepazu> ptressel, it will take me less time to do it by hand than to write a script for it and execute it
- # [23:31] <@shepazu> frozenice, once I have that list, we'll run it against MDN
- # [23:31] <@shepazu> and convert the results
- # [23:31] <@shepazu> then whammo, we're done
- # [23:31] <@frozenice> we will feed that list to the importer, yes
- # [23:31] <@shepazu> we only need to do this once
- # [23:31] <@shepazu> we don't need a repeatable process
- # [23:33] <ptressel> Don't we need to repeat this at wossname?
- # [23:33] <ptressel> Other site...
- # [23:33] <@shepazu> ptressel, quirksmode? caniuse?
- # [23:34] <@shepazu> caniuse.com already has a json feed of its results available
- # [23:34] <ptressel> Ah, right, caniuse
- # [23:35] <@shepazu> ptressel, so, we don't need to scrape it
- # [23:36] <@frozenice> I think we only need to get rid of https://github.com/webplatform/mdn-compat-importer/blob/master/index.js#L25 and put the master list into reader.links instead
- # [23:38] <@shepazu> OK
- # [23:38] <@frozenice> whoever coded that thing did a fine job of separating the tasks :D
- # [23:41] <ptressel> :D
- # [23:43] <ptressel> shepazu, That list is for CSS. What about others?
- # [23:44] <@shepazu> ptressel, I've gathered HTML attributes and elements so far, as well
- # [23:44] <@shepazu> working on the others
- # [23:44] <@shepazu> frozenice, if you do say so yourself?
- # [23:45] <@frozenice> well, I recognize good code when I see it!
- # [23:45] <ptressel> :D
- # [23:45] <@frozenice> (is "it" right here? sounds kinda wrong)
- # [23:46] <@shepazu> frozenice, yup, "it" is correct
- # [23:46] <ptressel> Heading toward gender confusion again? :D
- # [23:46] <@frozenice> :P
- # [23:46] <@shepazu> das ist in ordenung
- # [23:47] <@frozenice> hehe, almost perfect ("ordnung")
- # [23:47] <@shepazu> gah!
- # [23:48] <@shepazu> ich habe für drei jahre deutsch gelernt, aber ich has alles vergessen
- # [23:49] <@frozenice> not bad ("habe") :)
- # [23:49] <@shepazu> oh, yeah
- # [23:49] <@frozenice> denglish
- # [23:49] <@shepazu> doch
- # [23:49] <@shepazu> or should I say, do'ch!
- # [23:50] <@frozenice> :D
- # [23:50] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [23:52] <ptressel> Ok, so just to be clear... I don't need to do anything else? I should not add the crawler module to frozenice's code?
- # [23:52] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Read error: Connection reset by peer)
- # [23:52] * Joins: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [23:53] <@frozenice> I'd say if you want to do a crawler, do it as a separate project, so you are free in choice of modules etc., if it spits out a list of pages, we could use that
- # [23:55] <ptressel> Generally the crawler fetches the pages too.
- # [23:55] <@frozenice> hm indeed, maybe if it also parses out the compatibility HTML
- # [23:55] <ptressel> It's not a matter of "want"...I'm asking what needs to be done.
- # [23:56] * Joins: auchenbe_ (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk)
- # [23:56] <ptressel> Typical crawlers hand off to an indexer for parsing, except for extracting links to follow.
- # [23:56] <ptressel> E.g. nutch hands off to lucene and solr for indexing and serving
- # [23:57] * Quits: auchenberg (~auchenber@x1-6-00-8e-f2-36-28-8a.cpe.webspeed.dk) (Ping timeout: 255 seconds)
- # Session Close: Thu May 22 00:00:00 2014
The end :)