Showing posts with label data engineering bulletin. Show all posts
Showing posts with label data engineering bulletin. Show all posts

Monday, January 29, 2007

Structured data and the web

One of my main areas of focus at Google is on the relationship between structured data and web search. There are now vast amounts of structured data out there, mostly in the deep web (i.e., in databases behind HTML forms), created by annotation schemes (e.g., Flickr, Google Co-op, etc), and Google Base. The question then is how to use this data to improve the results of web search.

I recently published two related papers on this topic, one at CIDR 2007 and one in the Data Engineering Bulletin (go to Page 19 of the issue). You can read the papers for the details, but I'd like to highlight two key points from these papers that should be kept in mind when researching this area:

  1. Integration: Whenever you cook up an idea about how to improve web searh by leveraging structured data, or by automatically structuring data on the web, you need to keep in mind how your technique will integrate with other web searches. Users want to go to a single search box to find all their result. So whatever technique you come up with, needs to mesh well with other techniques used by the underlying engine.
  2. Data about everything: Many ideas work well if the domain of the data is constrained (e.g., you know you're building a portal to search for cars, housing or job listings). But on the web, data is about everything. There is no domain or set of domains that covers all data on the web. In fact, it's not even clear when one domain ends and another one begins. So try to imagine what it's like to deal with data about everything. That changes a lot in the way you think about a problem!