Thursday, July 31, 2008

OpenII: Open Source Information Integration Suite

Last week I hosted the OpenII kickoff workshop at Google. We had representatives from several companies: IBM, Microsoft, Yahoo, MITRE, Google, one guy who was supposed to represent Oracle but decided to be a professor again, and a couple of professors.

The goal of OpenII, as the name implies, is to create an open-source set of tools for information integration. The tool set will include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching a collection of schemas, and run-time tools for processing queries over heterogeneous data sets.

The main goal of the effort is to foster innovation in the field of information integration and create tools that are usable for a wide range of applications.

In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. Having a set of open-source tools will enable us to focus on our innovations and perform more meaningful comparisons between our methods.

On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts (e.g., materializing all the data in one repository vs. leaving the data in the sources and accessing it only at query time). In addition, many of the tools (e.g., schema matchers or dedup engines) often need to be extended for the particular domain in hand to fully leverage domain knowledge. Open source tools allow application developers to do exactly that.

You'll be hearing more about this project as we make progress. If you would like to contribute to it, please contact me!

Brain Rules and Computer Science Education

I just finished reading Brain Rules by John Medina. The last page of the book inspired this post, but let me first tell you what the book is about.

The book examines different aspects of our brain and derives from them some principles and suggestions on how to improve better behaviors. None of these suggested behaviors are surprising at all, but Medina (very entertainingly) explains why they are good for us given the brain's structure and describes some research that validates these claims. For example, chapter 1 tells you that doing aerobic exercise actually improves your mental capacities. Another chapter talks about the need for sleep and another on why stress is bad. Two chapters talk about memory (short term and long term), explaining why repetition of new knowledge can greatly improve its recall. Another chapter explains (finally!) that men and women are different (women are apparently much more complex than men, in case you need another shocker), and another stresses that we never stop learning in life (or at least, we have the capacity to).

At the end of every chapter, Medina asks how we can apply these nuggets of knowledge to develop new methods for education. In the last chapter, he explains why the unique aspect of medical schools make for well-trained doctors as well as curious researchers. The point is that in medical school students are learning the theory of medicine at the same time they are practicing it (in increasing doses as they advance in the program). Hence, they are able to apply their knowledge immediately and, after observing patients, ask novel questions that lead to new research and discoveries. Medina suggests that the same principle can be possibly applied in other disciplines.

I think Computer Science education can really benefit from such a model. I obviously don't have all the details worked out here, but imagine that every Computer Science department (or set of departments) had a software company on the side. As students go through the program, they start getting tasks from that company to build software for it, participate in designs, see how product decisions affect engineering processes, and even see some company politics at work. These companies will be real (they'll need to pay for these services) -- they'll generate real software for real customers.

I even have an initial idea of what these companies can do. Given that these companies are likely to have challenges competing in the market, they need to address a niche of customers who are willing to put up with lousy service, mediocre products and delays in software release cycles. I.e., customers who have nowhere else to go!

These customers are called scientists. Scientists are always complaining that they don't have the right software tools to do their science. Real companies typically don't find scientists to be an appealing set of customers because, well, they don't really want to pay up and they often have very specialized needs. There are huge challenges in creating good software tools for scientists, and university-affiliated companies could be an excellent place to develop this software, while preparing the next generation of computer scientists for the real world.

Tuesday, July 15, 2008

FOO Camp 2008

I spent the weekend at FOO Camp 2008, an annual event organized by publisher O'Reilly Media (hence the name, Friends Of O'Reilly). The event brought 275 movers and shakers of the tech industry and related industries, and was an incredible experience. It was as if someone injected into my brain the latest and greatest ideas and thoughts with one joyful syringe, accompanied with a few good glasses of wine. Michael Arrington of TechCrunch captures the spirit of FOO Camp in his blog post (and you can even see me standing and looking busy behind Jimmy Wales, the founder of Wikipedia, in one of his photos).

The conference begins with no set agenda. They put up an empty board with the different time slots and locations of sessions, and as the participants arrive, they fill up the board with sessions. There are about 10 sessions going on in parallel at any given time, most of them looking quite fascinating.

To give you a rough idea, within the span of a few hours, I attended sessions on:

-- aggregating meta-data on the web organized by Esther Dyson (i.e., all the data we create as we use services on the web),
-- the future (or lack thereof) of journalism (organized by several NY Times and SeattlePI reporters),
-- "open education" (tools, policies and politics of),
-- crowd-sourcing vs. curation (i.e., how to balance all the inputs one gets from the bloggers of the world with careful aggregation and analysis of information),
-- how computers can help humanities (e.g., analyzing the Bible, helping archaeologists), organized by Martin Wattenberg, the creator of Many Eyes,
-- educational tools for virtual worlds, and
-- a very well attended session on small things one can do to become happier in life.

There was also a session on "big data", organized by Roger Magoulas, the director of research at O'Reilly. The point I took away from that session is that owners of big data sets are now more confused than ever. They face a much wider array of architectural choices for data management systems than they ever did. These include map-reduce based systems, column stores, real-time warehouses, streaming systems, and various systems built on top of MySQL. Each of these architectures has its advantages and limitations, but it's becoming increasingly harder for application builders to understand the tradeoffs (and it's not like marketing departments are getting rewarded for making the choices clearer). It's no longer the world where you buy your favorite relational database system and you're done (and stuck). I think this situation presents some interesting research challenges for the database community (it's also interesting how some of these architectures get little attention in the community).

The idea of designing the conference program on the spot is very appealing, and I'd like to propose we do a little bit of it in traditional scientific conferences. (There is a concept of birds-of-feather session, but that's usually a grab bag of ideas). We should allot time slots in our conferences where sessions can be organized as the participants come to the conference and stimulate discussions there. That's a much better way of getting up to speed on hot topics and people's current thinking, which is what conferences should be for!