A short rant about the relationship between databases that manage uncertainty and lineage and data integration systems.
Recently, there has been renewed interest in building database systems that handle uncertain data and its lineage in a principled way. The Trio Project at Stanford and the MystiQ Project at the University of Washington are just two examples (I collaborated for a while on the former, and watched the latter up close while I was a professor at UW).
I think this is a great research area and certainly (no pun intended) a very timely one. I want to make two points though (one of which may raise Jennifer Widom's blood pressure).
First, I think data integration is the killer application of uncertainty and lineage (ok, maybe there is a second -- sensor networks). Fundamentally, data is uncertain when it comes from external sources and some of the transformations it went through on the way are not necessarily correct.
In fact, I think one of the greatest challenges for data integration research is to build data integration systems that deal gracefully with uncertainty (uncertainty can be about the underlying data, the schema mappings and the mapping of keyword queries to structured queries). If you have good ideas about this, please do contact me.
My second point is that there is really no argument here. In fact, I believe that once a database system is able to model and process uncertain data and its lineage, much of the distinction between traditional database systems and data integration systems goes away.
Specifically, by modeling data lineage and that it may be uncertain, you're admitting that the data came from somewhere, and that you're not sure about the transformations the brought it into the database or about its intrinsic meaning. That's exactly what data integration is about -- modeling data that comes from multiple sources. Unlike ordinary databases, where the data might as well have been born in the database because you know nothing about its past, databases with uncertainty and lineage admit that data had a prior life.
So then what's left of the difference between databases and data integration systems? Mostly issues having to do with query processing over remote sources.
I should emphasize -- I'm not claiming that these problems are solved (quite the contrary, see my comment about about data integration with uncertainty). But I do find it quite appealing that a database system models the fact that data came from the outside. That's the way it typically is in the real world, and it's about time databases realize it too.