Monday, June 29, 2009

Fusion Tables: The third piece of the puzzle

When I joined Google in 2005, the goal of my group was to explore the different aspects of structured data and the Web. The first and most burning need was to address the deep web, the collection of databases stored behind forms and invisible to search engines. We developed a completely automated system that has crawled millions of forms in over 50 languages and hundreds of domains. The system surfaces pages from the deep web by guessing good queries that can be posed on the forms, and inserting the resulting HTML pages into the Google index. These pages are shown in the top-10 results for over 1000 queries per second. For all the details, see the VLDB 2008 paper by Madhavan et al.

In a second project, we explored the collection of tables that are already on the surface web. We found over 150 million high-quality tables and developed a search engine for tables (see the VLDB 2008 paper by Cafarella et al. for the details). We also showed how to leverage 2.5 million table schemas that were part of this collection. This collection is now available to the research community.

On June 9th, we launched Fusion Tables, that represents the third piece of the puzzle of structured data and the web. The main goal of Fusion Tables is to make it easier for people to create, manage and share on structured data on the Web. Fusion Tables is a new kind of data management system that focuses on features that enable collaboration. We started with a relatively small set of features, but we’re rapidly expanding them, keeping our users’ requests as our top priority.

You can read the official announcement of Fusion Tables, and a great example of how it is used for data collection in the domain of water. In a nutshell, Fusion Tables enables you to upload tabular data (up to 100MB per table) from spreadsheets and CSV files. You can filter and aggregate the data and visualize it in several ways, such as maps and time lines. The system will try to recognize columns that represent geographical locations and suggest appropriate visualizations.

To collaborate, you can share a table with a select set of collaborators or make it public. One of the reasons to collaborate is to enable fusing data from multiple tables, which is a simple yet powerful form of data integration. If you have a table about water resources in the countries of the world, and I have data about the incidence of malaria in various countries, we can fuse our data on the country column, and see our data side by side. Importantly, we can do this while maintaining complete control of our own data.

Collaboration is not only about integration. Once the data is visible side by side, we may want to discuss it to understand it better or resolve conflicts. With Fusion Tables you can discuss data at multiple levels of granularity: rows, columns and individual cells. Hence, the data and the discussions are deeply integrated (or should I say, fused?)

Given our focus on collaboration, there are a lot of things we do not do (and we're pretty honest about it!). We do not support complex SQL queries or high throughput transactions. Despite our love for query optimization, we’ve implemented very little of it in the current system. We will, of course, add to these capabilities with time, but our real goal here is to explore data management for a broader audience of users and needs.

Please try it out and send us feedback! Our top priority now is to respond to our users' needs.

5 comments:

Eran said...

First of all, I think that its an important product. Our team uses Google spreadsheets increasingly for data sharing, but naturally it does not scale. There are several general data management features which we will find useful, including SQL support, foreign keys, richer visualizations, and smarter merging capabilities. For data processing, supporting a general language such as R would be nice.

However, it seems that collaboration features seems to be the most interesting and promising. I would suggest a versioning mechanism, allowing users to create annotated versions of their tables. Also, comments and changes may not be related to single cells, but can be related to a set or a range of cells.

Eran Toch.

Kingsley Uyi Idehen said...

Alon,

It would be nice if the underlying GUIDs for the Table Records are exposed as HTTP URIs. By doing this you create a powerful mechanism for meshing/fusing disparate data sources on the Web. Basically, you play well with the Linked Data meme.

Also, you end up with a more powerful Virtualization mechanism for Web Addressable Data.

You call it "Dataspaces" and I call the same thing "Data Spaces" (nee. Virtuoso Database Technology) :-)


Kingsey

Kingsley Idehen said...

Two little typo corrections.

I meant: Virtual Database Technology re. what you call DataSpaces.

Kingsley (the 2nd fix re. my own name typo) .

Marshall said...

Alon,

Congrats on Fusion Tables - As a teacher, I love visualizing data and being able to show such to my students. I plan to try it out really soon.

May I veer slightly off topic? Would you know the engineer who could help me with my latest spreadsheet fantasy request?

http://www.google.com/support/forum/p/Google+Docs/thread?tid=14d3480d5b0a6769&hl=en

Thanks,

marshallforrester@gmail.com

DanYamins said...

Hi -- so you post it said:

"We also showed how to leverage 2.5 million table schemas that were part of this collection. This collection is now available to the research community."

just wondering ... where are these tables available?

Thank,
Dan