Monday, April 14, 2008

Crawling the Deep Web

Our work on crawling the Deep Web has received some attention over the last few days. It started with a post on Google's Webmaster blog. Judging by the number of in-links to the blog (see the bottom the page) and the several news articles that picked it up, there were quite a few reactions on the blogosphere and beyond.

Matt Cutts, Google's main interface to web masters gives a nice explanation of why this work is useful to site owners. Anand Rajaraman details some of the history behind the technology that led to this work.

In summary, a nice example of research on data management having impact on the Web.

1 comment:

Anonymous said...

A new web community based method for extracting and identifying seeds sets for crawling has been developed by Daneshpajouh et al. If you are interested, you can find more info in their paper:
A Fast Community Based Algorithm for Generating Crawler Seeds Set, (WEBIST-2008), Funchal, Portugal, May 2008.