Deep data

Hi,

here’s a report on a quest in persuit of harvesting data from multi-page websites. Handling single pages is trivial, using tools like Dapper or YQL. It’s becoming tricky when targeting data that’s stored on separate webpages, like e.g. this one:

http://www.kerknet.be/zoek_parochie.php?allbisdom=1

The complete list of parishes is split over 7 separate pages, and I’d like to have them all together.

Before diving into the series of attempts, first let’s distinct two typicaly situations:
1) split data; the data is split over multiple pages and the pages are sequentially linked together
2) deep data; the main  page only contains links to the pages where the actual data is available

The above example is actually a combination of both, with as special factor that each datapage contains a single datarecord with different datafields.

Deep data

First attempt: YQL (failed)

This YQL query fetches the links on a single page:

select href from html where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1” and xpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a”

Results are like this:

<results>
<a href=”/parochie/parochie_fiche.php?parochieID=24″/>
<a href=”/parochie/parochie_fiche.php?parochieID=21″/>
<a href=”/parochie/parochie_fiche.php?parochieID=22″/>

</results>

Now I thought that the result of this query could be used as subquery for providing the actual dataquerey the list of pages using the “url in ()” statement, but overlooked the fact that the retrieved urls are relative.

So this YQL query is not working at all:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select href from html where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221; and xpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a”)

Second attempt: Dapper + YQL (succes on deep data issue)

Now I created a Dapper to extract the links from the overview page. Luckily, Dapper returns full paths, because it’s designed to render RSS feeds. The Dapper is configured such that it returns xml data-items such as this:

<item dataType=”RawString” fieldName=”item” href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=65&#8243; originalElement=”a” type=”field”>SINT-ANNA</item>

The output of the Dapper can now be queried by YQL:

select item.href from xml where url = “http://www.dapper.net/RunDapp?dappName=KerknetParochiesdeephtml&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8221;

… and submitted to the “url in ()” statement:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select item.href from xml where url = “http://www.dapper.net/RunDapp?dappName=KerknetParochiesdeephtml&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8221;)

This works, but it’s not addressing the split data on the ‘next’ list pages.

Third attempt: Yahoo Pipes + Dapper (success on split data issue)

YQL cannot solve the split data issue without creating a dedicated table and do some javascript coding. This table might have an interface looking like this:

select * from deepdata
where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221;
and urlxpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a/@href”
and nextxpath = “//span[@class=’pagelinks’]/a[current()=”Volgende”]/@href”
and xpath = “//table[@class=’contact first’]”

Where:

url is the starting listing page
urlxpath is the xpath for fetching links to the datapages from the listing pages
nextxpath is the xpath for fetching the link to the next listing pages
xpath is the xpath for extracting the data from the datapages

But as proof of concept, my next attempt involves Yahoo Pipes.

Conceptually, for retrieving each next page, a recursive setup would be ideal. A pipe cannot be called recursively out of the box, but I found the answer here:

Note that, in Yahoo! Pipes, recursion is performed by calling the pipe.run method of the <pipes.yahoo.com> service with the unique ID of the Pipe, parameters, and desired output format (e.g., JSON). Attempting to include a Pipe as a subpipe to itself directly may work while using the Pipes Editor, but does not when the Pipe is run outside the editor.

And after (quite) a while, here’s my pipe that returns the dapper result for the provided page + the dapper results for the next pages, by calling itself recursively. Note that it doesn’t do the actual deep data retrieval yet !

http://pipes.yahoo.com/pipes/pipe.edit?_id=3f5c8fc808ae3f27b3d6a5b578aa0da0

These are it’s input fields:

ListPage url for the list page
ListDapper dapper for parsing the list page
NextDapper dapper for retrieving the next page link

An example of running the pipe for rendering as rss:

http://pipes.yahoo.com/pipes/pipe.run?ListDapper=KerknetParochiesdeephtml&ListPage=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&NextDapper=KerknetParochiesdeephtmlnext&_id=3f5c8fc808ae3f27b3d6a5b578aa0da0&_render=rss

Note that it takes easily up to 2 minutes for the pipe to return! That’s for accessing 7 pages and returing a feed of ~300 items. What to expect when in the end ~300 pages must be downloaded and analyzed !!

Some notes that I made during debugging:

  • for debugging, using the pipe description page is giving better debug info than the development screen!
  • pipes are meant for processing rss data; generic xml data is bound to give errors.

Fourth attempt: Yahoo Pipes + Dapper + YQL (failed)

Combination of attempts 2 and 3 gives this:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select link from feed where url = “http://pipes.yahoo.com/pipes/pipe.run?ListDapper=KerknetParochiesdeephtml&ListPage=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&NextDapper=KerknetParochiesdeephtmlnext&_id=3f5c8fc808ae3f27b3d6a5b578aa0da0&_render=rss
“)

The main downside is that YQL returns messy extracts from (as messy) html code. So either a dedicated YQL table containing javascript code should be deviced, or dapper is called to the rescue again. Prepare for really bad performance this time!

Pipes won’t managing the dapper output… it has a hard time managing the XML data outside RSS context.

Fifth attempt: Yahoo Pipes + Dapper + YQL (success, but…)

Attacking the problem the other way around, the output of the messy YQL query can be submitted to dapper for extracting the data in a proper format.

The rest url becomes unreadible, but for sake of reference, it’s here:

http://www.dapper.net/RunDapp?dappName=KerknetParochiespipe&v=1&applyToUrl=http%3A%2F%2Fquery.yahooapis.com%2Fv1%2Fpublic%2Fyql%3Fq%3Dselect%2520*%2520from%2520html%2520where%2520xpath%2520%3D%2520%2522%2F%2Ftable%255b%40class%3D%27contact%2520first%27%255d%2522%2520and%2520url%2520in%2520%28select%2520link%2520from%2520feed%2520where%2520url%2520%3D%2520%2522http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3FListDapper%3DKerknetParochiesdeephtml%26ListPage%3Dhttp%253A%252F%252Fwww.kerknet.be%252Fzoek_parochie.php%253Fallbisdom%253D1%26NextDapper%3DKerknetParochiesdeephtmlnext%26_id%3D3f5c8fc808ae3f27b3d6a5b578aa0da0%26_render%3Drss%2522%29%26diagnostics%3Dtrue

The main problem for debugging is that all these services use thorough caching and on top, dapper is quite unreliable. So every once in a while, either you’ve made a typo, or dapper is refusing service, and somewhere in the chain, you end up with faulty data, which is cached and re-appearing on subsequent calls.

Also, performance of this setup is really bad. It could be used for offline data-retrieval, but not for interactive systems.

Sixth attempt: Same ingredients in different order

So I’ll use

  1. Yahoo Pipes because it has a clear interface to the end-user; my only fear is still that it won’t output custom xml formats
  2. YQL because it allows to program part of the logic and because it’s very performant
  3. Dapper because it has an excellent combination of logical data-extraction power and user-friendlyness

The pipe will pick up the parameters and call the YQL query. The query calls the dappers and collects the data.

Parameters:

datadapper

listdapper

nextdapper

url

mandatory

optional

optional

mandatory

scenario:

X

X

single page

X

X

X

“split data”

X

X

X

“deep data”

X

X

X

X

“split deep data”

This is the “split deep data” scenario query logic in pseudo-code:

list = yql(select * from xml where url=”http://www.dapper.net/dapper=listdapper&url=url”)
foreach dataurl in list {

newdata = yql(select * from xml where url=”http://www.dapper.net/dapper=datadapper&url=dataurl”)

data = concatenate(data, newdata)

}
nexturl = yql(select * from xml where url=”http://www.dapper.net/dapper=nextdapper&url=url”)
if next {

nextdata = yql(select * from splitdeepdata where datadapper=datadapper and listdapper=listdapper and nextdapper=nextdapper and url=nexturl)

data = concatenate(data, nextdata)

}
return data

Listdapper

select href, content from xml where url=”http://www.dapper.net/RunDapp?dappName=kerknetparochieslist&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8243; and itemPath=”//item”

returns

<results>
<item href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=24″>O.-L.-VROUW TEN HEMEL OPGENOMEN</item>
<item href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=21″>SINT-ANDRIES</item&gt;

</results>

note: the dapper must be configured to return hyperlinks to the datapages as ‘item’ elements

Datadapper

select * from xml where itemPath=”//item” and url=”http://www.dapper.net/RunDapp?dappName=KerknetParochiesdata&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fparochie%2Fparochie_fiche.php%3FparochieID%3D24&v=1&#8243;

returns

<results>
<item groupName=”item” type=”group”>
<parochie dataType=”RawString” fieldName=”parochie” originalElement=”th” type=”field”>O.-L.-VROUW TEN HEMEL OPGENOMEN, ANTWERPEN</parochie>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”th” type=”field”>Za</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”td” type=”field”>16.00u Eucharistieviering 17.00u Eucharistieviering</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”th” type=”field”>Zo</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”td” type=”field”>09.00u Eucharistieviering 10.30u Eucharistieviering 12.00u Eucharistieviering 17.00u Eucharistieviering</heiligemis>
<email dataType=”RawString” fieldName=”email” href=”http://www.kerknet.be/parochie/mailto.php?parochieID=24&amp;mailto=92&#8243; originalElement=”a” type=”field”>E-mail</email>
<adres1 dataType=”RawString” fieldName=”adres1″ originalElement=”span” type=”field”>Sint-Pieterstraat 1</adres1>
<adres2 dataType=”RawString” fieldName=”adres2″ originalElement=”span” type=”field”>2000 Antwerpen</adres2>
<telefoon dataType=”RawString” fieldName=”telefoon” originalElement=”span” type=”field”>Tel. 03/213.99.60</telefoon>
</item>
</results>

note: the dapper must be configured such that it groups elements in groups named ‘item’ !

Trying it out (see the open table logic on git):

use ‘http://github.com/vicmortelmans/yql-tables/raw/master/data/deepdapper.xml&#8217; as deepdapper;
select * from deepdapper where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221;
and datadapper = “KerknetParochiesdata”
and nextdapper = “KerknetParochiesdeephtmlnext”
and listdapper = “kerknetparochieslist”

Damn! bumping already into YQL’s execution rate limits:

The following rate limits apply to executions within Open Data Tables:

Item Limit
Total Time for Execution 30 seconds

Hey! it shouldn’t be giving an error sounding like this:

Exception: Circular table reference detected while using ‘deepdapper’. This table was already used in the call stack deepdapper

So, can’t I do recursion in YQL? – No, I can’t. Drawing my conclusions:

My attempt was to scrape data from paged html content. The table would scrape the content from the first page (the one provided in the where clause), then go looking for a ‘next page’ hyperlink and call itself again providing the next page url in the where clause, and return the collected data from the first page bundled with the data collected by the recursive call.

I guess I may have to look into the Open Tables paging functionality, or rewrite the algorithm as a loop.

Maybe the pagination feature should bring a fix? But I doubt.

Seventh attempt: rewriting the open table as a loop

next = url
while next {

list = yql(select * from xml where url=”http://www.dapper.net/dapper=listdapper&url=next”)

foreach dataurl in list {

newdata = yql(select * from xml where url=”http://www.dapper.net/dapper=datadapper&url=dataurl”)

data = concatenate(data, newdata)

}

next = yql(select * from xml where url=”http://www.dapper.net/dapper=nextdapper&url=next”)

}
return data

Still debugging, but read something interesting: “Queries called from y.query return and execute instantly. However, data is only returned when the results property is accessed. This feature allows you to make multiple, independent queries simultaneously that are then allowed to process before being returned together when the results property is accessed.” So  now I’m collecting all dataqueries in an array and only at the end, reading out the results… or maybe this isn’t so good at all, because now dapper will be attacked with the queries almost instantanously.

More debugging notes:

  • When I concatenate two XML lists, it seems to become a string !? The concatenation operator ‘+’ or ‘+=’ doesn’t work properly. Using the ‘appendChild’ method is safer!
  • Seems like submitting an XML List as response.object also isn’t a good idea. It likes genuine XML objects better!

Success!!

Now also making nextdapper and listdapper optional, and adding nextcount to limit the number of pages… and trying to find a better example than Kerknet for the documentation.

OK, this is ready for release. Documented and all!

Appendix A – filtering results of YQL xml table

In the itemPath parameter (not called ‘xpath’, as in the html table!), you can filter elements from the raw xml. Filtering down to attribute level won’t do anything.

In the select statement, you can filter elements down the result tree. Use ‘.’-separated paths. Don’t mention the root element! You can also address attributes (without ‘@’ !). Use ‘content’ to refer to the content of an element. List multiple path specifications to cover more data. The structure of the result tree remains the same (this means that attributes remain attributes!), but it’s limited to the elements the match the specified path(s), their parents and children.

Appendix B – small Git reminder notes

git add bible.xml
git commit -m “adding open table containing bible passages”
git push origin master

git pull
/*after editing online*/

Appendix C – debugging YQL tables

Yahoo! http://developer.yahoo.com/yql/console/?debug=true

Deep data

Hi,

here’s a report on a quest in persuit of harvesting data from multi-page websites. Handling single pages is trivial, using tools like Dapper or YQL. It’s becoming tricky when targeting data that’s stored on separate webpages, like e.g. this one:

http://www.kerknet.be/zoek_parochie.php?allbisdom=1

The complete list of parishes is split over 7 separate pages, and I’d like to have them all together.

Before diving into the series of attempts, first let’s distinct two typicaly situations:
1) split data; the data is split over multiple pages and the pages are sequentially linked together
2) deep data; the main  page only contains links to the pages where the actual data is available

The above example is actually a combination of both, with as special factor that each datapage contains a single datarecord with different datafields.

Deep data

First attempt: YQL (failed)

This YQL query fetches the links on a single page:

select href from html where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221; and xpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a”

Results are like this:

<results>
<a href=”/parochie/parochie_fiche.php?parochieID=24″/>
<a href=”/parochie/parochie_fiche.php?parochieID=21″/>
<a href=”/parochie/parochie_fiche.php?parochieID=22″/>

</results>

Now I thought that the result of this query could be used as subquery for providing the actual dataquerey the list of pages using the “url in ()” statement, but overlooked the fact that the retrieved urls are relative.

So this YQL query is not working at all:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select href from html where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221; and xpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a”)

Second attempt: Dapper + YQL (succes on deep data issue)

Now I created a Dapper to extract the links from the overview page. Luckily, Dapper returns full paths, because it’s designed to render RSS feeds. The Dapper is configured such that it returns xml data-items such as this:

<item dataType=”RawString” fieldName=”item” href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=65&#8243; originalElement=”a” type=”field”>SINT-ANNA</item>

The output of the Dapper can now be queried by YQL:

select item.href from xml where url = “http://www.dapper.net/RunDapp?dappName=KerknetParochiesdeephtml&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8221;

… and submitted to the “url in ()” statement:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select item.href from xml where url = “http://www.dapper.net/RunDapp?dappName=KerknetParochiesdeephtml&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8221;)

This works, but it’s not addressing the split data on the ‘next’ list pages.

Third attempt: Yahoo Pipes + Dapper (success on split data issue)

YQL cannot solve the split data issue without creating a dedicated table and do some javascript coding. This table might have an interface looking like this:

select * from deepdata
where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221;
and urlxpath = “//table[@class=’parochies’]/tr/td[@class=’col3′]/a/@href”
and nextxpath = “//span[@class=’pagelinks’]/a[current()=”Volgende”]/@href”
and xpath = “//table[@class=’contact first’]”

Where:

url is the starting listing page
urlxpath is the xpath for fetching links to the datapages from the listing pages
nextxpath is the xpath for fetching the link to the next listing pages
xpath is the xpath for extracting the data from the datapages

But as proof of concept, my next attempt involves Yahoo Pipes.

Conceptually, for retrieving each next page, a recursive setup would be ideal. A pipe cannot be called recursively out of the box, but I found the answer here:

Note that, in Yahoo! Pipes, recursion is performed by calling the pipe.run method of the <pipes.yahoo.com> service with the unique ID of the Pipe, parameters, and desired output format (e.g., JSON). Attempting to include a Pipe as a subpipe to itself directly may work while using the Pipes Editor, but does not when the Pipe is run outside the editor.

And after (quite) a while, here’s my pipe that returns the dapper result for the provided page + the dapper results for the next pages, by calling itself recursively. Note that it doesn’t do the actual deep data retrieval yet !

http://pipes.yahoo.com/pipes/pipe.edit?_id=3f5c8fc808ae3f27b3d6a5b578aa0da0

These are it’s input fields:

ListPage url for the list page
ListDapper dapper for parsing the list page
NextDapper dapper for retrieving the next page link

An example of running the pipe for rendering as rss:

http://pipes.yahoo.com/pipes/pipe.run?ListDapper=KerknetParochiesdeephtml&ListPage=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&NextDapper=KerknetParochiesdeephtmlnext&_id=3f5c8fc808ae3f27b3d6a5b578aa0da0&_render=rss

Note that it takes easily up to 2 minutes for the pipe to return! That’s for accessing 7 pages and returing a feed of ~300 items. What to expect when in the end ~300 pages must be downloaded and analyzed !!

Some notes that I made during debugging:

  • for debugging, using the pipe description page is giving better debug info than the development screen!
  • pipes are meant for processing rss data; generic xml data is bound to give errors.

Fourth attempt: Yahoo Pipes + Dapper + YQL (failed)

Combination of attempts 2 and 3 gives this:

select * from html where xpath = “//table[@class=’contact first’]” and url in (select link from feed where url = “http://pipes.yahoo.com/pipes/pipe.run?ListDapper=KerknetParochiesdeephtml&ListPage=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&NextDapper=KerknetParochiesdeephtmlnext&_id=3f5c8fc808ae3f27b3d6a5b578aa0da0&_render=rss
“)

The main downside is that YQL returns messy extracts from (as messy) html code. So either a dedicated YQL table containing javascript code should be deviced, or dapper is called to the rescue again. Prepare for really bad performance this time!

Pipes won’t managing the dapper output… it has a hard time managing the XML data outside RSS context.

Fifth attempt: Yahoo Pipes + Dapper + YQL (success, but…)

Attacking the problem the other way around, the output of the messy YQL query can be submitted to dapper for extracting the data in a proper format.

The rest url becomes unreadible, but for sake of reference, it’s here:

http://www.dapper.net/RunDapp?dappName=KerknetParochiespipe&v=1&applyToUrl=http%3A%2F%2Fquery.yahooapis.com%2Fv1%2Fpublic%2Fyql%3Fq%3Dselect%2520*%2520from%2520html%2520where%2520xpath%2520%3D%2520%2522%2F%2Ftable%255b%40class%3D%27contact%2520first%27%255d%2522%2520and%2520url%2520in%2520%28select%2520link%2520from%2520feed%2520where%2520url%2520%3D%2520%2522http%3A%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3FListDapper%3DKerknetParochiesdeephtml%26ListPage%3Dhttp%253A%252F%252Fwww.kerknet.be%252Fzoek_parochie.php%253Fallbisdom%253D1%26NextDapper%3DKerknetParochiesdeephtmlnext%26_id%3D3f5c8fc808ae3f27b3d6a5b578aa0da0%26_render%3Drss%2522%29%26diagnostics%3Dtrue

The main problem for debugging is that all these services use thorough caching and on top, dapper is quite unreliable. So every once in a while, either you’ve made a typo, or dapper is refusing service, and somewhere in the chain, you end up with faulty data, which is cached and re-appearing on subsequent calls.

Also, performance of this setup is really bad. It could be used for offline data-retrieval, but not for interactive systems.

Sixth attempt: Same ingredients in different order

So I’ll use

  1. Yahoo Pipes because it has a clear interface to the end-user; my only fear is still that it won’t output custom xml formats
  2. YQL because it allows to program part of the logic and because it’s very performant
  3. Dapper because it has an excellent combination of logical data-extraction power and user-friendlyness

The pipe will pick up the parameters and call the YQL query. The query calls the dappers and collects the data.

Parameters:

datadapper

listdapper

nextdapper

url

mandatory

optional

optional

mandatory

scenario:

X

X

single page

X

X

X

“split data”

X

X

X

“deep data”

X

X

X

X

“split deep data”

This is the “split deep data” scenario query logic in pseudo-code:

list = yql(select * from xml where url=”http://www.dapper.net/dapper=listdapper&url=url”)
foreach dataurl in list {

newdata = yql(select * from xml where url=”http://www.dapper.net/dapper=datadapper&url=dataurl”)

data = concatenate(data, newdata)

}
nexturl = yql(select * from xml where url=”http://www.dapper.net/dapper=nextdapper&url=url”)
if next {

nextdata = yql(select * from splitdeepdata where datadapper=datadapper and listdapper=listdapper and nextdapper=nextdapper and url=nexturl)

data = concatenate(data, nextdata)

}
return data

Listdapper

select href, content from xml where url=”http://www.dapper.net/RunDapp?dappName=kerknetparochieslist&v=1&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fzoek_parochie.php%3Fallbisdom%3D1&#8243; and itemPath=”//item”

returns

<results>
<item href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=24″>O.-L.-VROUW TEN HEMEL OPGENOMEN</item>
<item href=”http://www.kerknet.be/parochie/parochie_fiche.php?parochieID=21″>SINT-ANDRIES</item&gt;

</results>

note: the dapper must be configured to return hyperlinks to the datapages as ‘item’ elements

Datadapper

select * from xml where itemPath=”//item” and url=”http://www.dapper.net/RunDapp?dappName=KerknetParochiesdata&applyToUrl=http%3A%2F%2Fwww.kerknet.be%2Fparochie%2Fparochie_fiche.php%3FparochieID%3D24&v=1&#8243;

returns

<results>
<item groupName=”item” type=”group”>
<parochie dataType=”RawString” fieldName=”parochie” originalElement=”th” type=”field”>O.-L.-VROUW TEN HEMEL OPGENOMEN, ANTWERPEN</parochie>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”th” type=”field”>Za</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”td” type=”field”>16.00u Eucharistieviering 17.00u Eucharistieviering</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”th” type=”field”>Zo</heiligemis>
<heiligemis dataType=”RawString” fieldName=”heiligemis” originalElement=”td” type=”field”>09.00u Eucharistieviering 10.30u Eucharistieviering 12.00u Eucharistieviering 17.00u Eucharistieviering</heiligemis>
<email dataType=”RawString” fieldName=”email” href=”http://www.kerknet.be/parochie/mailto.php?parochieID=24&amp;mailto=92&#8243; originalElement=”a” type=”field”>E-mail</email>
<adres1 dataType=”RawString” fieldName=”adres1″ originalElement=”span” type=”field”>Sint-Pieterstraat 1</adres1>
<adres2 dataType=”RawString” fieldName=”adres2″ originalElement=”span” type=”field”>2000 Antwerpen</adres2>
<telefoon dataType=”RawString” fieldName=”telefoon” originalElement=”span” type=”field”>Tel. 03/213.99.60</telefoon>
</item>
</results>

note: the dapper must be configured such that it groups elements in groups named ‘item’ !

Trying it out (see the open table logic on git):

use ‘http://github.com/vicmortelmans/yql-tables/raw/master/data/deepdapper.xml&#8217; as deepdapper;
select * from deepdapper where url = “http://www.kerknet.be/zoek_parochie.php?allbisdom=1&#8221;
and datadapper = “KerknetParochiesdata”
and nextdapper = “KerknetParochiesdeephtmlnext”
and listdapper = “kerknetparochieslist”

Damn! bumping already into YQL’s execution rate limits:

The following rate limits apply to executions within Open Data Tables:

Item Limit
Total Time for Execution 30 seconds

Hey! it shouldn’t be giving an error sounding like this:

Exception: Circular table reference detected while using ‘deepdapper’. This table was already used in the call stack deepdapper

So, can’t I do recursion in YQL? – No, I can’t. Drawing my conclusions:

My attempt was to scrape data from paged html content. The table would scrape the content from the first page (the one provided in the where clause), then go looking for a ‘next page’ hyperlink and call itself again providing the next page url in the where clause, and return the collected data from the first page bundled with the data collected by the recursive call.

I guess I may have to look into the Open Tables paging functionality, or rewrite the algorithm as a loop.

Maybe the pagination feature should bring a fix? But I doubt.

Seventh attempt: rewriting the open table as a loop

next = url
while next {

list = yql(select * from xml where url=”http://www.dapper.net/dapper=listdapper&url=next”)

foreach dataurl in list {

newdata = yql(select * from xml where url=”http://www.dapper.net/dapper=datadapper&url=dataurl”)

data = concatenate(data, newdata)

}

next = yql(select * from xml where url=”http://www.dapper.net/dapper=nextdapper&url=next”)

}
return data

Still debugging, but read something interesting: “Queries called from y.query return and execute instantly. However, data is only returned when the results property is accessed. This feature allows you to make multiple, independent queries simultaneously that are then allowed to process before being returned together when the results property is accessed.” So  now I’m collecting all dataqueries in an array and only at the end, reading out the results… or maybe this isn’t so good at all, because now dapper will be attacked with the queries almost instantanously.

More debugging notes:

  • When I concatenate two XML lists, it seems to become a string !? The concatenation operator ‘+’ or ‘+=’ doesn’t work properly. Using the ‘appendChild’ method is safer!
  • Seems like submitting an XML List as response.object also isn’t a good idea. It likes genuine XML objects better!

Success!!

Now also making nextdapper and listdapper optional, and adding nextcount to limit the number of pages… and trying to find a better example than Kerknet for the documentation.

OK, this is ready for release. Documented and all!

Appendix A – filtering results of YQL xml table

In the itemPath parameter (not called ‘xpath’, as in the html table!), you can filter elements from the raw xml. Filtering down to attribute level won’t do anything.

In the select statement, you can filter elements down the result tree. Use ‘.’-separated paths. Don’t mention the root element! You can also address attributes (without ‘@’ !). Use ‘content’ to refer to the content of an element. List multiple path specifications to cover more data. The structure of the result tree remains the same (this means that attributes remain attributes!), but it’s limited to the elements the match the specified path(s), their parents and children.

Appendix B – small Git reminder notes

git add bible.xml
git commit -m “adding open table containing bible passages”
git push origin master

git pull
/*after editing online*/

Appendix C – debugging YQL tables

Yahoo! http://developer.yahoo.com/yql/console/?debug=true

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s