Yahoo! Developer Network Blog

« Previous | Main | Next »


September 22, 2008

SearchMonkey Support for RDFa Enabled

Yahoo! Search is now extracting RDFa data across the World Wide Web and making this information available to the public via SearchMonkey. RDFa is an open standard for embedding structured data directly in HTML. Along with our previous support for eRDF and a number of popular microformats, SearchMonkey now supports a wide variety of popular semantic technologies.

What is structured data, and why is structured data good for search? Traditional search engines crawl the web and extract what metadata they can: the page title, an autogenerated summary, the file size, the MIME-type, the last-updated date, and so on. However, this sort of analysis pales in comparison to what a human being can do simply by glancing at the page. A human can look at the words "Joe's Home Page" and infer, "ah, this page probably belongs to Joe," or look at an image and infer, "ah, that's probably a picture of Joe, the owner of the page." That's easy enough for humans... but what if the search engine could pick out this info and display it directly in the search result?

RDFa relies on using attributes to embed structured data in XHTML. These attributes are not valid in HTML 4, but the W3C has provided an XHTML DTD to validate against. The following example illustrates a simple home page marked up with RDFa data (in bold):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
          "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:foaf="http://xmlns.com/foaf/0.1/"
      lang="en" xml:lang="en">
<head>
  <title>The Amazing Home Page of Joe Smith</title>
</head>
<body>
  <h1 property="dc:title">Joe's Home Page</h1>
  <div rel="foaf:maker">
    <h2 property="foaf:name">Joe Smith</h2>
    <div rel="foaf:depiction" resource="http://joesmith.org/images/jsmith.png">
      <img src="/images/jsmith.png" alt="Smiling headshot of Joe" />
      <p property="dc:rights">Creative Commons Attribution 3.0 Unported</p>
    </div>
  </div>
</body>
</html>

In this page, the designer has explicitly stated that the image is a "depiction of the person who made the web page." Adding this information as RDFa can potentially benefit many applications. In the case of Yahoo!, we've designed our search index to extract and store this information.

RDFa support has already enabled some interesting new SearchMonkey applications. For instance, Creative Commons has recently started to deploy RDFa across the web in the form of copyright and licensing information. Every time a Creative Commons user selects a CC license, the generated HTML badge contains RDFa markup indicating the nature of the license. The Creative Commons Infobar uses this data to selectively trigger on pages that declare their license using structured markup:

SearchMonkey Infobar showing a Creative Commons license

To get started with RDFa:

Evan Goer
Yahoo! SearchMonkey Team

Posted at September 22, 2008 2:00 PM | Permalink

Bookmark this on Delicious

Comments

I thought you had been extracting RDFa for a while now... Wasn't that the case?

Posted by: David Peterson at September 24, 2008 7:06 AM

Nope, we launched back in May with eRDF and a few popular microformats such as hcard, hcalendar, and XFN. With the Candidate Recommendation on June 20, we were able to move forward with RDFa.

Posted by: Evan at September 24, 2008 11:51 AM

Is it possible to query on specific values set with RDFa markup? For example, I just added the following to http://www.snee.com/bobdc.blog/2008/02/the_future_of_rdfa.html:

<span xmlns:sn="http://www.snee.com/ns/whatever/"
about="http://www.flickr.com/photos/bobdc/1881958623"
property="sn:goofinessFactor" content="3.4"/>

Once the Yahoo crawlers know that this page has been updated, is there a query that could access the page or any of the information in the triple based on the information in the triple?

thanks,

Bob

Posted by: Bob DuCharme at September 27, 2008 12:55 PM

Hi Bob -- Right now, the only way to inspect and play with the actual contents of the structured markup is the SearchMonkey devtool. The front end of the search engine just supports simple filtering. You can say, "Give me all pages that A) have hresume data and B) happen to have the string 'PHP'", but not "Give me all pages with 'PHP' marked up as part of the hresume."

Posted by: Evan at September 29, 2008 9:27 AM

It sounds like the front end currently doesn't support anything specific to RDFa. Is this correct?

thanks,

BOb

Posted by: Bob DuCharme at October 1, 2008 12:28 PM

Post a comment

Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.

Remember Me?

Subscribe

YDN Blog: Get Yahoo! Developer Network Blog on your personalized My Yahoo! home page.

Add To My RSS Feed

YDN Link Blog: Get Yahoo! Developer Network Linkblog on your personalized My Yahoo! home page.

Add To My RSS Feed

Recent Readers

Copyright © 2010 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions