Monday, January 29, 2007

Uncertainty and data integration

A short rant about the relationship between databases that manage uncertainty and lineage and data integration systems.

Recently, there has been renewed interest in building database systems that handle uncertain data and its lineage in a principled way. The Trio Project at Stanford and the MystiQ Project at the University of Washington are just two examples (I collaborated for a while on the former, and watched the latter up close while I was a professor at UW).

I think this is a great research area and certainly (no pun intended) a very timely one. I want to make two points though (one of which may raise Jennifer Widom's blood pressure).

First, I think data integration is the killer application of uncertainty and lineage (ok, maybe there is a second -- sensor networks). Fundamentally, data is uncertain when it comes from external sources and some of the transformations it went through on the way are not necessarily correct.

In fact, I think one of the greatest challenges for data integration research is to build data integration systems that deal gracefully with uncertainty (uncertainty can be about the underlying data, the schema mappings and the mapping of keyword queries to structured queries). If you have good ideas about this, please do contact me.

My second point is that there is really no argument here. In fact, I believe that once a database system is able to model and process uncertain data and its lineage, much of the distinction between traditional database systems and data integration systems goes away.

Specifically, by modeling data lineage and that it may be uncertain, you're admitting that the data came from somewhere, and that you're not sure about the transformations the brought it into the database or about its intrinsic meaning. That's exactly what data integration is about -- modeling data that comes from multiple sources. Unlike ordinary databases, where the data might as well have been born in the database because you know nothing about its past, databases with uncertainty and lineage admit that data had a prior life.

So then what's left of the difference between databases and data integration systems? Mostly issues having to do with query processing over remote sources.

I should emphasize -- I'm not claiming that these problems are solved (quite the contrary, see my comment about about data integration with uncertainty). But I do find it quite appealing that a database system models the fact that data came from the outside. That's the way it typically is in the real world, and it's about time databases realize it too.

6 comments:

said...

Hi Alon,

I just finished reading the paper "From Databases to Dataspaces: A New Abstraction for Information Management" and I would say it did englitened me.

I'm curious about what's your opinions about the differences between "Information Management or data-spaces" and "Data Integration". IMHO, they look like a same thing.

So, if DBMS can model uncertainty in data-sources ('participants' in data-spaces), does it means DBMS approaches will be the only dominated stuff in data-spaces and the same for ACID properties?

Regards,

Jin He

Naveen said...

I think another 'killer application' for uncertain data management is information extraction. Its high time
extraction systems incorporate and provide measures of (un)certainty, along with the extracted data returned.

auto said...

i fully agree

David said...

i agree with naveen too

said...

I am a freshman in "Data integration". I have already read some papers such as "Data Integration with Uncertainty" and Databases with Uncertainty and Lineage". If I want to have a general idea about uncertainty in data integration, which kind of other papers should I study?

said...

I am a freshman in "Data integration". I have already read some papers such as "Data Integration with Uncertainty" and Databases with Uncertainty and Lineage". If I want to have a general idea about uncertainty in data integration, which kind of other papers should I study?