Yahoo! Developer Network Blog
« Previous | Main | Next »
October 15, 2009
What Powers SearchMonkey?
Ever wonder how SearchMonkey generates all of this structured data for use in projects like Enhanced Results, Object Facets, Site Facets, BOSS, and ranking? Have you wondered what Yahoo envisions as "a web of concepts", or how SearchMonkey is helping Yahoo! power its next generation of search experiences? Or perhaps you were searching for "rickroll videos" and reached this page by mistake. No matter — here's almost everything you wanted to know about how we get the structured data to power SearchMonkey.
A monkey standing on the shoulders of giants
First is the easy answer: we get the data from our site owners! Yahoo!'s opinion is that nobody understands the content of a website more than the site owner. Through page markup (both RDFa and microformats), structured data feeds, custom data services created with the SearchMonkey developer tool, and XSLT rules written by third party tools, we'll take whatever data that site owners tell us is important about the content on their site. Even if you're the not the site owner, you can still submit custom data services to SearchMonkey Gallery. If approved, your extraction techniques will be applied by Yahoo Search for others to build applications upon.
What does SearchMonkey feed data look like?
To create a feed, first create an Atom feed of the URLs you want to annotate. Then within each Atom <entry>, add the metadata for the page. For example, assume you have restaurant listings already shown in Yahoo! Search, and you want to show the average review you've collected from your users. One of the entries in your feed would look like:
<y:adjunct version="1.0" name="local" xmlns:y="http://search.yahoo.com/datarss/">
<y:item rel="dc:subject">
<y:type typeof="vcard:VCard commerce:Business">
<y:item rel="vcard:url" resource="http://local.yahoo.com/info-21328305-yahoo-incorporated-sunnyvale"/>
<y:item rel="review:hasReview">
<y:type typeof="review:Review">
<y:meta property="review:rating" datatype="xsd:decimal">4.5</y:meta>
<y:meta property="review:totalRatings" datatype="xsd:integer">32</y:meta>
</y:type>
</y:item>
</y:type>
</y:item>
</y:adjunct>
Which results in:
It's that easy! Create a couple million of these annotations (hopefully not by hand!), and you are ready to submit it to Site Explorer. When Yahoo processes the feed, your results will be enhanced and appear with 4.5 stars and "32 reviews". Users will see your search result, realize that you actually have reviews available from the page, and will go to your page to read reviews of this restaurant. That's not just theory — our lab monkeys have proved repeatedly that enhanced results actually increase traffic to sites.
How do I do this with XSLT?
Not everybody likes building a feed. If you're XML-savvy, you can use XSLT to instruct Yahoo! how to extract the components of your page that you want to appear in your enhanced results. As a quick example, we'll create a rule for Yahoo! Local. With your favorite XPATH plugin for Firefox (mine is XPather), you can quickly identify xpaths for Yahoo! to follow and extract data from. For example, the following xpath extracts "32 reviews":
//div[@id="yls-dt-tabs"]//li[2]//em
Insert that xpath into Yahoo-provided boilerplate XSLT, and we'll run that to extract the structured data from your site. If this approach seems pretty fragile because it assumes the HTML structure will not change... you're right. Some sites seldom change their HTML, and other sites are really good about naming the important nodes with unique ids. For those sites, XSLT rules are fairly resilient to changes in the page structure. For other sites that have high variation in their page structure or that are constantly making sweeping changes, XSLT extraction usually doesn't end very well. Fortunately, there's yet another approach.
But wait! There's more!
Next is the clever way to do things: it turns out that lot of the web has already been structured by the semantic web. Yahoo! already has billions of documents which have been annotated by either RDF or microformats. This semantic information provides some solid hints about the structured data on the page, which we happily use for SearchMonkey.
Annotating for the semantic web essentially involves tweaking your page markup to provide additional meaning about the elements on that page. The example below uses RDFa, but you can also provide the same information to Yahoo Search using microformats.
<div typeof="vcard:VCard commerce:Business"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:vcard="http://www.w3.org/2006/vcard/ns#"
xmlns:commerce="http://search.yahoo.com/searchmonkey/commerce/"
xmlns:review="http://purl.org/stuff/rev#">
<h1><span property="vcard:fn">Yahoo! Incorporated</span></h1>
<span>
<span>User Rating:
<span property="review:rating">4.5</span> out of <span property="review:maxRating">5</span> stars
(<span property="review:totalRatings">32</span> reviews).
</span>
</span>
</div>
When Yahoo sees the additional RDFa markup in the HTML, we extract the structured data complete with semantics. Your page is no longer a bag of unstructured words — Yahoo! now has information that helps us understand your page better.
But what about the rest of the web?
Finally, the answer from the big brains: magic. A lot of research has gone into "web data mining" both inside Yahoo, and at academic and corporate research institutions worldwide. Deep inside Yahoo! Research, tribes of monkeys are busy creating new technologies for Yahoo to extract objects out of web pages. I wish I could provide a few examples about this magic, but there is no way that I can summarize decades of named entity recognition research in five lines or less. For a good place to get started, see publications on TREC.
Any other methods that Yahoo! uses to extract structured data from web content?
Pizza — lots of pizza.
Are you a site owner interested in how you can build your site to make it easier for us to extract the structured data? We would love your participation.
Interested in being a monkey? We're also hiring!
Kevin Haas
Senior Engineering Manager, Yahoo! SearchMonkey
Posted at October 15, 2009 10:38 AM | Permalink
Comments
This sounds really interesting and I would LOVE to be able to use this but I am not technical at all and have NO idea what you are talking about!!! Could you send me the "Idiots Guide to Using Search Monkey" please? I would be forever grateful and forgive me for my ignorance. I thank you in advance for your help.
Sharon
sr_clean@sbcglobal.net
Posted by: Sharon at October 16, 2009 9:44 AM
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.
Subscribe
Recent Blog Articles
view all
YQL Open Table for Google Buzz now live
Tue, 09 Feb 2010
INSERT INTO twitter.status ...
Mon, 08 Feb 2010
Announcing the Yahoo! Brasil Open Hack Day 2010, 20-21 March
Mon, 08 Feb 2010
Marketing hacks, linchpins, and tech women of valor
Sun, 07 Feb 2010
Yahoo! India invites you to join the first India Hadoop Summit
Thu, 04 Feb 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
2010
2009
2008
2007
2006
2005
Recent Readers

