When I joined Google in 2005, the goal of my group was to explore the different aspects of structured data and the Web. The first and most burning need was to address the deep web, the collection of databases stored behind forms and invisible to search engines. We developed a completely automated system that has crawled millions of forms in over 50 languages and hundreds of domains. The system surfaces pages from the deep web by guessing good queries that can be posed on the forms, and inserting the resulting HTML pages into the Google index. These pages are shown in the top-10 results for over 1000 queries per second. For all the details, see the VLDB 2008 paper by Madhavan et al.
In a second project, we explored the collection of tables that are already on the surface web. We found over 150 million high-quality tables and developed a search engine for tables (see the VLDB 2008 paper by Cafarella et al. for the details). We also showed how to leverage 2.5 million table schemas that were part of this collection. This collection is now available to the research community.
On June 9th, we launched Fusion Tables, that represents the third piece of the puzzle of structured data and the web. The main goal of Fusion Tables is to make it easier for people to create, manage and share on structured data on the Web. Fusion Tables is a new kind of data management system that focuses on features that enable collaboration. We started with a relatively small set of features, but we’re rapidly expanding them, keeping our users’ requests as our top priority.
You can read the official announcement of Fusion Tables, and a great example of how it is used for data collection in the domain of water. In a nutshell, Fusion Tables enables you to upload tabular data (up to 100MB per table) from spreadsheets and CSV files. You can filter and aggregate the data and visualize it in several ways, such as maps and time lines. The system will try to recognize columns that represent geographical locations and suggest appropriate visualizations.
To collaborate, you can share a table with a select set of collaborators or make it public. One of the reasons to collaborate is to enable fusing data from multiple tables, which is a simple yet powerful form of data integration. If you have a table about water resources in the countries of the world, and I have data about the incidence of malaria in various countries, we can fuse our data on the country column, and see our data side by side. Importantly, we can do this while maintaining complete control of our own data.
Collaboration is not only about integration. Once the data is visible side by side, we may want to discuss it to understand it better or resolve conflicts. With Fusion Tables you can discuss data at multiple levels of granularity: rows, columns and individual cells. Hence, the data and the discussions are deeply integrated (or should I say, fused?)
Given our focus on collaboration, there are a lot of things we do not do (and we're pretty honest about it!). We do not support complex SQL queries or high throughput transactions. Despite our love for query optimization, we’ve implemented very little of it in the current system. We will, of course, add to these capabilities with time, but our real goal here is to explore data management for a broader audience of users and needs.
Please try it out and send us feedback! Our top priority now is to respond to our users' needs.