Monday, January 29, 2007

Structured data and the web

One of my main areas of focus at Google is on the relationship between structured data and web search. There are now vast amounts of structured data out there, mostly in the deep web (i.e., in databases behind HTML forms), created by annotation schemes (e.g., Flickr, Google Co-op, etc), and Google Base. The question then is how to use this data to improve the results of web search.

I recently published two related papers on this topic, one at CIDR 2007 and one in the Data Engineering Bulletin (go to Page 19 of the issue). You can read the papers for the details, but I'd like to highlight two key points from these papers that should be kept in mind when researching this area:

  1. Integration: Whenever you cook up an idea about how to improve web searh by leveraging structured data, or by automatically structuring data on the web, you need to keep in mind how your technique will integrate with other web searches. Users want to go to a single search box to find all their result. So whatever technique you come up with, needs to mesh well with other techniques used by the underlying engine.
  2. Data about everything: Many ideas work well if the domain of the data is constrained (e.g., you know you're building a portal to search for cars, housing or job listings). But on the web, data is about everything. There is no domain or set of domains that covers all data on the web. In fact, it's not even clear when one domain ends and another one begins. So try to imagine what it's like to deal with data about everything. That changes a lot in the way you think about a problem!


Logan Henriquez said...

I've been developing a vertical search site ( that structures data for a small domain. Naturally this led to theorizing about how to generalize the functionality, and I'm interested in your view on some of the ideas I came up with.

* A user-driven taxonomy for each domain. As your paper pointed out, this is hard, but sites like Wikipedia already list hundreds of thousands of domains, define them, and usually specify taxonomies for them (the contents list). You could imagine crawling wikipedia monthly and using their domain definitions and taxonomies to organize crawled results. This makes the taxonomy and domain definitions dynamic and user-driven.

* A domain specific result definition. I'm assuming here that true structured search is about showing the user objects relevant to their domain, not just URLs. For example, vertical search sites serve objects: (houses for sale), (artwork for sale), or (jobs). The title field of Wikipedia's listings usually has a noun that is the object thats most relevant to the domain. Object attributes might also be extracted from Wikipedia with some linguistic analysis, althought this would be harder.

* Domain-specific url list: Google co-op and high-pagerank for domain-specific keywords could be a source of urls to limit results to in order to increase relevance when a user searches in a specific domain.

* A domain-specific relevancy ranking. For art I've developed my own that is a weighted average of all the attributes that art buyers seem to be interested in. This could be developed by domain experts, learned from user behavior, or for many domains it could be as simple as a straight average of the number of matched attributes. Google's current pageranking mechanism might also be an element of the ranking.

* A domain-specific page parser to extract objects and attributes from URLs for a specific domain. This seems like the hardest element to do well. For some domains like product search across e-commerce sites, the relevant objects (products) are listed on the page with metadata that helps define the attributes and identify their values. The Wikipedia taxonomy would also offer a set of words that could be used to identify attributes. But for many domains there is no metadata and the relevant attribute => value pairs will be hard to extract. For example, when artsugar crawls art gallery sites there is no metadata, so I have to use an array of techniques unique to how art gallery sites are organized to extract attributes like medium, size, artist, color, etc. There's no quick solution here but user-driven submissions to facilities like Google Base or even microformats offer hope. Today, website creators expend enormous effort trying to raise the ranking of their results in Google with SEO techniques. What if instead they knew that using the appropriate microformat would give them a good ranking in a major search engine's domain-specific results? Wouldn't they do it? (microformats are much easier than SEO). They'd probably also try to spam the results, but its much easier to control spam when you're serving up results for a specific domain and you know what the objects are and their attributes.


logan henriquez

Anonymous said...

Thanks for the nice post!

rastaquere said...

Hello Alon,
Interesting post.
If users want everything from a single search query box, and you want the engine to mesh well with
the underlying engine's techniques, then natural language seems the only answer. The ultimate underlying order of the web is language integrated by hypertext and database structures. You might like this:
It's an interactive suggestion tool to search anything with any variable in a vast data set. In this case a large auto sales catalog. Forgive the German please but this is already working in any language.

It's not natural language per se but realtime natural language is far off still and only applies to one language. With this, you can search multilingual data too. What do you think?

exorbyte ATT gmail DOAT com

Anonymous said...

nice post