Thursday, July 31, 2008

OpenII: Open Source Information Integration Suite

Last week I hosted the OpenII kickoff workshop at Google. We had representatives from several companies: IBM, Microsoft, Yahoo, MITRE, Google, one guy who was supposed to represent Oracle but decided to be a professor again, and a couple of professors.

The goal of OpenII, as the name implies, is to create an open-source set of tools for information integration. The tool set will include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching a collection of schemas, and run-time tools for processing queries over heterogeneous data sets.

The main goal of the effort is to foster innovation in the field of information integration and create tools that are usable for a wide range of applications.

In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. Having a set of open-source tools will enable us to focus on our innovations and perform more meaningful comparisons between our methods.

On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts (e.g., materializing all the data in one repository vs. leaving the data in the sources and accessing it only at query time). In addition, many of the tools (e.g., schema matchers or dedup engines) often need to be extended for the particular domain in hand to fully leverage domain knowledge. Open source tools allow application developers to do exactly that.

You'll be hearing more about this project as we make progress. If you would like to contribute to it, please contact me!


Jeff Hammerbacher said...

Hey Alon,

Where can I get the source for OpenII? The product at seems to have similar aims but doesn't look like it has your involvement.

Quite excited to check out the product!


Anonymous said...

I'll get you for that... :-)

Ahmed Elmagarmid said...

This is an excellent idea. We have tried to do something similar for record linkage on a much smaller and less ambitious scale. Mohammed Elfeky, now at Google, developed a toolbox named Taylor which has been useful for many academicians even after many years have passed. It allows one to test their RL algorithms using some available data sets along with some other features.

NSF both at the OCI and CISE divisions have been encouraging this type of work. A group of us at Purdue have gotten funded a few years ago to do something like this for DB systems research.

Using the open source philosophy makes so much sense and will indeed spur many contributions.

Please keep us informed of developments through the blog.

Anonymous said...

I noticed that the OpenII google code page now has some actual code+commits -- is there a blog/wiki now that has more info on what's currently in the code base?


pmork said...

Jeff, the OpenII code is available for download at This project is not affiliated with The easy to remember URL for OpenII is

Anonymous, the following components are currently available:
* SchemaStore is the basic repository for schemata, mappings, derivations and groups.
* Harmony is a schema matching tool and corresponding UI.
* RMap is a code generation tool that converts schema mappings into SQL code.
* Galaxy is a repository browser that displays the derivation relationships among schemata.
* OpenII includes schema importers and exporters for several common formalisms, as well as a few ad hoc importers.

The current list of components is also available at

Please let us know if you have contributions that could be incorporated into OpenII!