Monday, January 29, 2007

Uncertainty and data integration

A short rant about the relationship between databases that manage uncertainty and lineage and data integration systems.

Recently, there has been renewed interest in building database systems that handle uncertain data and its lineage in a principled way. The Trio Project at Stanford and the MystiQ Project at the University of Washington are just two examples (I collaborated for a while on the former, and watched the latter up close while I was a professor at UW).

I think this is a great research area and certainly (no pun intended) a very timely one. I want to make two points though (one of which may raise Jennifer Widom's blood pressure).

First, I think data integration is the killer application of uncertainty and lineage (ok, maybe there is a second -- sensor networks). Fundamentally, data is uncertain when it comes from external sources and some of the transformations it went through on the way are not necessarily correct.

In fact, I think one of the greatest challenges for data integration research is to build data integration systems that deal gracefully with uncertainty (uncertainty can be about the underlying data, the schema mappings and the mapping of keyword queries to structured queries). If you have good ideas about this, please do contact me.

My second point is that there is really no argument here. In fact, I believe that once a database system is able to model and process uncertain data and its lineage, much of the distinction between traditional database systems and data integration systems goes away.

Specifically, by modeling data lineage and that it may be uncertain, you're admitting that the data came from somewhere, and that you're not sure about the transformations the brought it into the database or about its intrinsic meaning. That's exactly what data integration is about -- modeling data that comes from multiple sources. Unlike ordinary databases, where the data might as well have been born in the database because you know nothing about its past, databases with uncertainty and lineage admit that data had a prior life.

So then what's left of the difference between databases and data integration systems? Mostly issues having to do with query processing over remote sources.

I should emphasize -- I'm not claiming that these problems are solved (quite the contrary, see my comment about about data integration with uncertainty). But I do find it quite appealing that a database system models the fact that data came from the outside. That's the way it typically is in the real world, and it's about time databases realize it too.

Structured data and the web

One of my main areas of focus at Google is on the relationship between structured data and web search. There are now vast amounts of structured data out there, mostly in the deep web (i.e., in databases behind HTML forms), created by annotation schemes (e.g., Flickr, Google Co-op, etc), and Google Base. The question then is how to use this data to improve the results of web search.

I recently published two related papers on this topic, one at CIDR 2007 and one in the Data Engineering Bulletin (go to Page 19 of the issue). You can read the papers for the details, but I'd like to highlight two key points from these papers that should be kept in mind when researching this area:

  1. Integration: Whenever you cook up an idea about how to improve web searh by leveraging structured data, or by automatically structuring data on the web, you need to keep in mind how your technique will integrate with other web searches. Users want to go to a single search box to find all their result. So whatever technique you come up with, needs to mesh well with other techniques used by the underlying engine.
  2. Data about everything: Many ideas work well if the domain of the data is constrained (e.g., you know you're building a portal to search for cars, housing or job listings). But on the web, data is about everything. There is no domain or set of domains that covers all data on the web. In fact, it's not even clear when one domain ends and another one begins. So try to imagine what it's like to deal with data about everything. That changes a lot in the way you think about a problem!

Ein Gedi

I recently went for a quick trip to visit my family in Israel (with a short stop in Amsterdam, where I saw an amazing exhibit).

While in Israel, I went for a hike in Ein-Gedi, one of my all-time favorite places. The trip was organized by the Tova Milo's database group at Tel-Aviv University. Check out the pictures from Ein Gedi (and a few others).

While Ein Gedi has always been a place for me to find complete peace, staring into the Dead Sea, this time my blackberry made it a slightly different experience.

Sunday, January 28, 2007

Introduction

I've finally decided to start blogging and am very excited about it!

The posts on this blog will either be about work (i.e., data management ideas) or my family. No politics (you probably don't want to hear it anyway), but possibly a passing comment on coffee or other exciting events.

In way of background, until recently I've been a professor at the University of Washington. I moved to Google in September 2005 and lead a group looks at how structured data can be used in Web search. As I publish papers about this work, I'll summarize them here. For all my publications prior to coming to Google you can check out my UW web site.

One of the goals of this blog is to get people in the data management community to share novel ideas and discuss them. While technical results are well served by pubished papers, Web 2.0 gives us the opportunity to discuss ideas outside our conferences quite easily.