Yahoo! Developer Network Blog
« Previous | Main | Next »
September 22, 2008
SearchMonkey Support for RDFa Enabled
Yahoo! Search is now extracting RDFa data across the World Wide Web and making this information available to the public via SearchMonkey. RDFa is an open standard for embedding structured data directly in HTML. Along with our previous support for eRDF and a number of popular microformats, SearchMonkey now supports a wide variety of popular semantic technologies.
What is structured data, and why is structured data good for search? Traditional search engines crawl the web and extract what metadata they can: the page title, an autogenerated summary, the file size, the MIME-type, the last-updated date, and so on. However, this sort of analysis pales in comparison to what a human being can do simply by glancing at the page. A human can look at the words "Joe's Home Page" and infer, "ah, this page probably belongs to Joe," or look at an image and infer, "ah, that's probably a picture of Joe, the owner of the page." That's easy enough for humans... but what if the search engine could pick out this info and display it directly in the search result?
RDFa relies on using attributes to embed structured data in XHTML. These attributes are not valid in HTML 4, but the W3C has provided an XHTML DTD to validate against. The following example illustrates a simple home page marked up with RDFa data (in bold):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
lang="en" xml:lang="en">
<head>
<title>The Amazing Home Page of Joe Smith</title>
</head>
<body>
<h1 property="dc:title">Joe's Home Page</h1>
<div rel="foaf:maker">
<h2 property="foaf:name">Joe Smith</h2>
<div rel="foaf:depiction" resource="http://joesmith.org/images/jsmith.png">
<img src="/images/jsmith.png" alt="Smiling headshot of Joe" />
<p property="dc:rights">Creative Commons Attribution 3.0 Unported</p>
</div>
</div>
</body>
</html>
In this page, the designer has explicitly stated that the image is a "depiction of the person who made the web page." Adding this information as RDFa can potentially benefit many applications. In the case of Yahoo!, we've designed our search index to extract and store this information.
RDFa support has already enabled some interesting new SearchMonkey applications. For instance, Creative Commons has recently started to deploy RDFa across the web in the form of copyright and licensing information. Every time a Creative Commons user selects a CC license, the generated HTML badge contains RDFa markup indicating the nature of the license. The Creative Commons Infobar uses this data to selectively trigger on pages that declare their license using structured markup:

To get started with RDFa:
- Learn the basics with the W3C RDFa Primer
- Dive into the details with the full RDFa Specification
- Join the community at the RDFa homepage
- Test your structured markup skills with the RDFa Distiller
- Filter on RDFa in Yahoo! searches with the
searchmonkeyid:com.yahoo.rdf.rdfaparameter - Start displaying RDFa to millions of users with the SearchMonkey developer tool
Evan Goer
Yahoo! SearchMonkey Team
Posted at September 22, 2008 2:00 PM | Permalink
Comments
I thought you had been extracting RDFa for a while now... Wasn't that the case?
Posted by: David Peterson at September 24, 2008 7:06 AM
Nope, we launched back in May with eRDF and a few popular microformats such as hcard, hcalendar, and XFN. With the Candidate Recommendation on June 20, we were able to move forward with RDFa.
Posted by: Evan at September 24, 2008 11:51 AM
Is it possible to query on specific values set with RDFa markup? For example, I just added the following to http://www.snee.com/bobdc.blog/2008/02/the_future_of_rdfa.html:
<span xmlns:sn="http://www.snee.com/ns/whatever/"
about="http://www.flickr.com/photos/bobdc/1881958623"
property="sn:goofinessFactor" content="3.4"/>
Once the Yahoo crawlers know that this page has been updated, is there a query that could access the page or any of the information in the triple based on the information in the triple?
thanks,
Bob
Posted by: Bob DuCharme at September 27, 2008 12:55 PM
Hi Bob -- Right now, the only way to inspect and play with the actual contents of the structured markup is the SearchMonkey devtool. The front end of the search engine just supports simple filtering. You can say, "Give me all pages that A) have hresume data and B) happen to have the string 'PHP'", but not "Give me all pages with 'PHP' marked up as part of the hresume."
Posted by: Evan at September 29, 2008 9:27 AM
It sounds like the front end currently doesn't support anything specific to RDFa. Is this correct?
thanks,
BOb
Posted by: Bob DuCharme at October 1, 2008 12:28 PM
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.
Subscribe
Recent Blog Articles
view all
YQL Open Table for Google Buzz now live
Tue, 09 Feb 2010
INSERT INTO twitter.status ...
Mon, 08 Feb 2010
Announcing the Yahoo! Brasil Open Hack Day 2010, 20-21 March
Mon, 08 Feb 2010
Marketing hacks, linchpins, and tech women of valor
Sun, 07 Feb 2010
Yahoo! India invites you to join the first India Hadoop Summit
Thu, 04 Feb 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
2010
2009
2008
2007
2006
2005
Recent Readers

