It’s fairly simple to mimick (some of) the behaviour of Dapper using these two tools. I’m into this, because I bumped into a Dapper disadvantage (well, in most cases it’s an advantage, but not in my particular case): it reduces the captured html fragments to plain text, which is good for a simple RSS update, but bad if you want to retain formatting in captured text (e.g. italics may be important and relevant for reflecting the content of a text).
- Install the SelectorGadget in your browser (it’s a bookmarklet, so you can simply drag it into your bookmarks)
- Open the webpage you want to retrieve data from (or at least a very similar webpage)
- Start the SelectorGadget
- Play around until you’ve selected the right sections
- Copy the displayed CSS selector
- Open the YQL console
You’ll see a sample queryselect * from data.html.cssselect where url=”http://www.doorstroming.net/index.php/actua/49-qnationalisme-is-nationalismeq.html” and css=”.MsoNormal”
- Replace the value for ‘css’ with the CSS selector generated by SelectorGadget
- Replace the value for ‘url’ with the URL of the page containing your data. This can be any page with a similar structure to the page you’ve used originally.
- Execute the query or copy the REST URL
The REST URL looks like this:
By replacing the bold sections with resp. the (url-encoded) URL of the website and the applicable CSS Selector (also url-encoded!), you can apply any query on any webpage.
- for more complex selections, SelectorGadget returns stuff like this
:nth-child(3) , .article-meta
It looks like the YQL table can’t cope with this (probably the :nth-child() is something coming from a newer CSS standard than supported by the YQL table)
- Dapper can do lots more, like capturing multiple fields, grouping, rendering RSS and other output formats,…
- Dapper has an intuitive GUI
Using html open table
To solve the first restriction, using the html open table is a good solution. SelectorGadget has a button for converting the CSS selector into an XPath expression. The query then looks like this:
select * from html where url=”http://www.doorstroming.net/index.php/actua/49-qnationalisme-is-nationalismeq.html” and xpath=’//*[contains(concat( ” “, @class, ” ” ), concat( ” “, “MsoNormal”, ” ” ))]’