Monday, June 25, 2007


I just returned from SIGMOD 2007 in Beijing. This was only the second time SIGMOD has gone out of North America, and it was quite a success. The organization, led by Profs. Zhou Lizhu and Tok Wang Ling and the technical program, chaired by Beng-Chin Ooi from the National University of Singapore, were excellent. It’s hard to beat the Summer Palace as a location for a conference banquet.

I wish I could give a technical summary of all the innovations reported at the conference. Instead, I’ll comment about a conclusion I drew from the three keynotes and give a quick plug for my own paper (this is my blog, after all).

SIGMOD featured 3 excellent keynotes from esteemed members of our community. Phil Bernstein from Microsoft Research talked about the progress made on Model Management, a research agenda he started 7 years ago, and on the challenges ahead. H.V. Jagadish from the University of Michigan talked about how to make databases usable, identifying some of the pain points and research challenges. Finally, Gerhard Weikum from the Max Plank Institute in Saarbruecken, Germany, talked about research combining databases and information retrieval. Each of the wrote a paper for the proceedings that is very worth reading.

I found a couple of common themes among the keynotes. First, they are all pushing the community in important and non-traditional directions. I found that extremely heartening. Second, I think the three keynotes, each from a different angle, support the claim that we need a much better understanding of users and their pain points when they work with structured data. That's a very touchy subject for database folks, who are used to spending their time 'under the hood'.

In Jagadish’s case, usability was the subject of the talk, and hence understanding users is crucial. In Phil’s case, he gave the example of generating schema mappings (mappings between disparate databases), and he was trying to get at what the pain points may be there (he argued that in the contexts he’s been considering, producing schema matches, typically the first step in mapping generation, is not longer the bottleneck). In Gerhard’s case, the question that comes up is what is the right answer when we combine DB&IR in a single system, i.e., what is the real user need. In a DB system, the semantics of the query clearly dictate the answer, but when you combine structured and unstructured data, it's no longer clear what the ranking criteria should be.

The point I’m trying to make is the following. As a community, we need to study user needs as they work with structured data, whether they are creating data, trying to understand existing data, formulating queries or creating mappings. Importantly, we need to keep in mind that a user’s task is rarely just to get an answer from a structured database. Users are typically working with both structured and unstructured data, and their tasks are broader than a single query. A useful interaction with a system is one that brings them closer to completing their task (I know, this his fuzzy, but that's why it's research).

It is tempting to push these problems to the HCI community, but I would argue this is a mistake. These problems will not be high enough on the agenda of the HCI community (there, if your device doesn’t move or perform magic, it’s uninteresting), whereas for us they are crucial for identifying good research directions and evaluating them. As a community, we need to find a way to encourage research on usability and to learn from the HCI community how to evaluate such research. We need to bring this agenda squarely into our conferences.

I'm not the only one to touch on this topic, and we're not the only community to see this need. A recent report titled “The Landscape of Parallel Computing Research: A View from Berkeley”, argues a similar point about developing novel programming models. I thin visualization is an important component of this research agenda (see Anant Jhingran’s blog post about this very point), and see Laura Haas’ ICDE excellent recent keynote and paper for a very nice articulation of this argument in the context of data integration.

My student Luna Dong presented our paper on indexing dataspaces. This paper has the distinction of being the first technical paper I published with dataspaces in its title. The paper describes a set of indexing methods that enable efficient querying of a collection of loosely coupled data sources (i.e., we do not have semantic mappings between them). Because of the nature of dataspaces, the queries we support enable the users to specify structure when she knows it, and keywords to complement the structure (we call them predicate queries and association queries). The basic idea underlying the solution is to extend the technique of inverted lists from IR to incorporate information about the structure of the data. Importantly, the technique also incorporates hierarchies in the data, and therefore it enables uniform querying data sets that have different underlying structures. Luna performed a series of experiments showing the benefits of our approach, and comparing them to techniques for indexing XML that are the closest contenders to address these problems.


Valentin said...

Enjoy your blog - but one technical comment: your use of custom markup like: <st1:place st="on"><st1:city st="on">Saarbruecken</st1:city> results in your post beeing unreadable when read through Bloglines (and potentially other RSS reader as well). Bloglines simply does not display the text enclosed in these tags - resulting in for example: "It’s hard to beat the as a location for a conference banquet. "

cartomanzia amore said...

i enjoyed your post reading

auto said...

very nice post and blog, go on like that, i ll raccomend your blog and come back very soon!