Dapper clone

Using SelectorGadget and YQL community open table data.html.cssselect

It’s fairly simple to mimick (some of) the behaviour of Dapper using these two tools. I’m into this, because I bumped into a Dapper disadvantage (well, in most cases it’s an advantage, but not in my particular case): it reduces the captured html fragments to plain text, which is good for a simple RSS update, but bad if you want to retain formatting in captured text (e.g. italics may be important and relevant for reflecting the content of a text).

Procedure:

  1. Install the SelectorGadget in your browser (it’s a bookmarklet, so you can simply drag it into your bookmarks)
  2. Open the webpage you want to retrieve data from (or at least a very similar webpage)
  3. Start the SelectorGadget
  4. Play around until you’ve selected the right sections
  5. Copy the displayed CSS selector
  6. Open the YQL console
    You’ll see a sample query select * from data.html.cssselect where url=”http://www.doorstroming.net/index.php/actua/49-qnationalisme-is-nationalismeq.html” and css=”.MsoNormal”
  7. Replace the value for ‘css’ with the CSS selector generated by SelectorGadget
  8. Replace the value for ‘url’ with the URL of the page containing your data. This can be any page with a similar structure to the page you’ve used originally.
  9. Execute the query or copy the REST URL

The REST URL looks like this:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20data.html.cssselect%20where%20url%3D%22http%3A%2F%2Fwww.doorstroming.net%2Findex.php%2Factua%2F49-qnationalisme-is-nationalismeq.html%22%20and%20css%3D%22.MsoNormal%22&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
By replacing the bold sections with resp. the (url-encoded) URL of the website and the applicable CSS Selector (also url-encoded!), you can apply any query on any webpage.

Restrictions:

  • for more complex selections, SelectorGadget returns stuff like this
    :nth-child(3) , .article-meta
    It looks like the YQL table can’t cope with this (probably the :nth-child() is something coming from a newer CSS standard than supported by the YQL table)
  • Dapper can do lots more, like capturing multiple fields, grouping, rendering RSS and other output formats,…
  • Dapper has an intuitive GUI

Using html open table

To solve the first restriction, using the html open table is a good solution. SelectorGadget has a button for converting the CSS selector into an XPath expression. The query then looks like this:
select * from html where url=”http://www.doorstroming.net/index.php/actua/49-qnationalisme-is-nationalismeq.html” and xpath=’//*[contains(concat( ” “, @class, ” ” ), concat( ” “, “MsoNormal”, ” ” ))]’

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s