I recently published two related papers on this topic, one at CIDR 2007 and one in the Data Engineering Bulletin (go to Page 19 of the issue). You can read the papers for the details, but I'd like to highlight two key points from these papers that should be kept in mind when researching this area:
- Integration: Whenever you cook up an idea about how to improve web searh by leveraging structured data, or by automatically structuring data on the web, you need to keep in mind how your technique will integrate with other web searches. Users want to go to a single search box to find all their result. So whatever technique you come up with, needs to mesh well with other techniques used by the underlying engine.
- Data about everything: Many ideas work well if the domain of the data is constrained (e.g., you know you're building a portal to search for cars, housing or job listings). But on the web, data is about everything. There is no domain or set of domains that covers all data on the web. In fact, it's not even clear when one domain ends and another one begins. So try to imagine what it's like to deal with data about everything. That changes a lot in the way you think about a problem!