Monday, April 14, 2008

Crawling the Deep Web

Our work on crawling the Deep Web has received some attention over the last few days. It started with a post on Google's Webmaster blog. Judging by the number of in-links to the blog (see the bottom the page) and the several news articles that picked it up, there were quite a few reactions on the blogosphere and beyond.

Matt Cutts, Google's main interface to web masters gives a nice explanation of why this work is useful to site owners. Anand Rajaraman details some of the history behind the technology that led to this work.

In summary, a nice example of research on data management having impact on the Web.

Wednesday, April 2, 2008

Bar-coding in Costa Rica

I spent last weekend in the Area de Conservacion Guanacaste (ACG) in Costa Rica with a few of my colleagues. We were hosted by Dan Janzen and Winnie Hallwachs (his wife), an incredibly inspiring pair of biologists. Among many other awards, Dan is also a recipient of the Kyoto Prize in 1987. For the past 30 years, Dan and Winnie have spent half of every year in Costa Rica, creating the ACG, while spending the other half professing at the University of Pennsylvania. I'll have to skip the details of how we got there, but do ask me in person when you see me (and if you need to juggle my memory, use the phrase "party in the sky").

You can find the pictures from the trip here.

So you're probably wondering what was a guy like me, with questionable credentials in Biology, is doing in such a biologically intense area?

Imagine that every living species and plant had a barcode, just like products in a supermarket. Furthermore, imagine that you had a device, the size of a cell phone, such that when you found a specimen in the forest, you can put the specimen into the device and it would tell you all the known information about it. In addition to being a useful device to take on hikes, such a device can have major impact on agriculture and controlling the spread of disease.

The International Bar-Code of Life Project (iBol) is trying to do exactly that, based on genomic techniques. Specifically, it turns out that with over 98% accuracy, the CO1 gene uniquely determines the species. In contrast, the traditional approach to determining species is based on morphological features. By sequencing the CO1, Janzen and many others have been able to uncover several mysteries, showing that species that look very similar are actually different, and vice versa. Janzen runs the biggest specimen collection operation (Costa Rica happens to have a huge number of different species, hence Janzen's conservation goal). Currently he sends them to the University of Guelph in Canada for sequencing (in a lab run by Paul Hebert who was also there), but they envision that in a decade, we'll be able to build the small device.

We spent the weekend in numerous and intense discussions on Biology, walking through the forest seeing it first hand, and actually participating in the process of collecting specimens and preparing them to be sent for sequencing.

In the discussions we tried to understand the challenges involved in this project (including arguments by its critics). It actually turns out that determining species can often be very subjective, for two reasons. First, the determination typically needs to be done with only partial information about the set of specimens available and unless you can find other evidence, morphology is typically the deciding factor. Second, and somewhat more surprising to me, not all biologists completely agree on what the concept of species even means. The most accepted definition is based on the ability to mate and create viable offsprings, but there are other opinions as well (e.g., it's the morphology stupid). In fact, when it's not even clear (to me, at least) that classification into species is as important as it's traditionally been considered, since many of the questions we're asking about animals or plants depend on other genetic and environmental traits.

And yes, there are huge data management challenges here. Many scientists are collecting data and each putting it into their own format. They would like to share their data but also maintain control of their own. They'd like to publish the data on the web and make it accessible to the masses. They need to manage uncertainty and provenance. Ironically, one of the closest systems I know that is considering some of these issues is Orchestra, built by Zack Ives at the... University of Pennsylvania (i.e., a few buildings away from Janzen's office).

Then there was the flight back, but I can't talk about that either. Overall, an incredible experience! Many thanks to Dan and Winnie (and their crew) for hosting us and sharing their incredible knowledge and passion!