Yahoo! Developer Network Blog
« Previous | Main | Next »
November 18, 2009
Scraping HTML documents that require POST data with YQL
YQL is a great tool to scrape HTML from the web and turn it into data to reuse. This is not an illegal act as it can be very useful to reuse information maintained for example on a blog. My personal portfolio page http://icant.co.uk gets most of its data from my blog hosted elsewhere.
Using the in-built YQL table for html allows you to scrape any HTML that allows the YQL server to access it (some sites modify robots.txt to prevent that which is something we comply with). For example, the cnn.com homepage:
select * from html where url="http://cnn.com"
The great thing about using this versus simply using cURL to load the data is that YQL runs the result through HTML Tidy to turn it into XML compliant data and removes badly encoded characters, which can be a big nuisance. The other great feature is that you can use XPATH to filter down the data to what you need. If we want all the links of the cnn.com homepage we can use this:
select * from html where url="http://cnn.com" and xpath="//a"
One thing that is not that known is that if you only want the text content of an element and still keep the element structure, you can select the content instead of the * wildcard:
select content from html where url="http://cnn.com" and xpath="//a"
This is all cool and nice, but the problem is that when you need to send POST data to an HTML document before you scrape it you cannot use YQL - as you can't send POST data on the URL. The workaround is to write an open data table with an execute block that does this job for you.
You can use this new table like this:
select * from htmlpost where
url='http://isithackday.com/hacks/htmlpost/index.php'
and postdata="foo=foo&bar=bar" and xpath="//p"
There is a detailed write-up about the why and how of this table available, but here is the excerpt of the table source that is the most important:
var myRequest = y.rest(url);
var data = myRequest.accept('text/html').
contentType("application/x-www-form-urlencoded").
post(postdata).response;
var xdata = y.xpath(data,xpath);
response.object = {xdata} ;
You define a new request and chain all the necessary data in a single line of JavaScript. As the YQL Execute code runs on the server you have a more powerful API than in the browser, so you can determine that you want html back, that you send the request as a form submission and simply add the POST data as a parameter of the post() method. You can run any xpath transformation over the returned data using the xpath() method. Every script in the execute block should return an object with the XML data and as execute allows for E4X you don't need to mess around with DOM generation of nodes.
This is just one example of the power of YQL Execute, please think up more cases that need solutions and have a go yourself. Submit your table to the GitHub repository or tell us about it on the forums.
Chris Heilmann
@codepo8
Yahoo Developer Network
Posted at November 18, 2009 6:30 PM | Permalink
Comments
Hi, I've got some error
:18: response.object = {xdata}^;]]>
#1)]]>
Posted by: Leo at January 11, 2010 3:15 PM
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.
Subscribe
Recent Blog Articles
view all
YQL Open Table for Google Buzz now live
Tue, 09 Feb 2010
INSERT INTO twitter.status ...
Mon, 08 Feb 2010
Announcing the Yahoo! Brasil Open Hack Day 2010, 20-21 March
Mon, 08 Feb 2010
Marketing hacks, linchpins, and tech women of valor
Sun, 07 Feb 2010
Yahoo! India invites you to join the first India Hadoop Summit
Thu, 04 Feb 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
2010
2009
2008
2007
2006
2005
Recent Readers

