My Dad is 80!

My dad turned 80 this month, and we celebrated the event with a workshop and reception in his honor at the Weizmann Institute of Science, in Rehovot, Israel. The full set of pictures from the event can be seen here.

My dad is a professor of Chemistry at the Weizmann Institute. After fighting in the Israeli War of Independence, he was finally able to focus on his studies. He completed his Ph.D in a little less than 2 years(!!) in 1955 at Syracuse University in New York (fortunately, because that's where he met my mom). When he's asked how he did that, he simply shows the picture below.

He still doesn't understand why it took me an entire 5 years to do my Ph.D, and worse, in a field that uses the term 'science' in a questionable fashion. (When I got promoted to full professor he finally figured I might be doing something right).

My dad's main claim to fame is a 1-page article he wrote during his post-doc. The following makes the point better than I can - it's a quote from Krzysztof Matyjaszewski (a CMU professor) and Axel Muller (professor at the U. of Beyreuth, Germany) in their foreword to the December 2006 special issue of the Journal of Progress in Polymer Science on "50 Years of Living Polymerization":

On June 5, 1956, Michael Szwarc, together with Moshe Levy and Ralph Milkovich pubished an article entitled "Polymerization initiated by electron transfer to monomer - A new method for formation of block copolymers", J Am Chem Soc (1956), 2656.

In this article the term "living polymer" appeared for the first time. It caused a revolution in polymer science.

Michael Szwarc (who was also my dad's Ph.D adviser) received the Kyoto Prize for this work in 1991.

He has worked in many areas over the years, but since the mid-80's my dad has been one of the pioneers in solar energy research, studying methods for chemical storage of solar energy so it can be used any time and transported to less sunny locations. In fact, he published two papers on using solar energy for chemical transformations this year! As a befitting token or recognition, he received an awesome Google solar t-shirt...

And he definitely needs the t-shirt. He still gets up every morning at 6am to either play tennis, or go for a run & workout, which includes running up 15 flights of stairs in the solar tower at the institute!

It was a great event, and a wonderful chance to see many of my dad's colleagues throughout his career, some of whom I had not seen since I was a kid. It was also the first full gathering of all the family's grandchildren.

Dataspaces for Veterans

In the U.S. we are marking Veteran's Day tomorrow. There are many ways in which we should be thanking our veterans and making their lives better. I'd like to report a rather unique way.

I recently had the opportunity to visit the Veteran's Administration Hospital in Washington DC and learn first-hand about their patient-record system. I was pleased to see the principles of dataspaces in action, clearly enabling better healthcare services.

The VA provides services to veterans of the American military and has around 150 hospitals, 800 clinics and 200 nursing centers scattered around the country. To support these services, the VA maintains electronic records for all their patients, a system that has won them many accolades in recent years. The system stores the patients' prescriptions, doctor visits, lab tests and other data about each patient. As their patients often move around and receive treatment in various locations, when a doctor views the data about a patient, it needs to be integrated from multiple VA locations. Each of these locations is running their own system. In addition, data about their patients may reside in systems of the Department of Defense (and their healthcare providers) and various drugstore chains.

Clearly, this is an incredible data integration problem. Today they are aware of at least 130 different "implementations" of their electronic record system, i.e., different schemas. Also, given the different local needs of hospitals and clinics, imposing a single schema on all the VA centers would not work. Using a data integration solution at this scale and in such a dynamic environment would be extremely difficult.

Instead, what the VA did is standardize on a very small subset of patients' attributes, namely attributes describing patients' vital signs. Outside of this set of attributes, hospitals are free to develop their own local data organizations. However, the system lets the healthcare providers see all the data even if it's not completely integrated. So for example, if a doctor wants to see what happened to a patient while they were at a remote location, then the remote data may appear as plain text, and therefore the doctor would have to work a little harder to digest it, and won't be able to pose the queries she could pose on the local data. But being able to see the data in some form is infinitely better than not seeing it at all, and the doctors are extremely happy with the system's capabilities.

The VA also demonstrated two examples of the pay-as-you-go principle that is at the foundation of dataspaces. The first was the fact that they decided that vital signs are critical, so their data sources are aligned on the attributes relating to those (effectively, creating semantic mappings involving the attributes of vital signs), and they plan to continue agreeing on terminology as they see fit. Second, they had a culture that allowed for local innovation, class-3 applications, that represented needs at the local level. When these needs were perceived to be important throughout the organization, they promoted them to class-1 applications, and required all their systems to support them.

Just to make it clear, when I walked in the door they did not greet me and say: "Pleased to see you Dr. Halevy; we'd love to show you our dataspace system". What I'm describing is a post-rationalization of a system that was developed over more than a decade. I believe that their loose integration was the key to their success.

A Murder Mystery with a Twist

I just finished reading dot.dead, a Silicon Valley murder mystery by Keith Raffel. Yes, a murder happening in Palo Alto at the home of Ian Michaels, a high-tech executive; searching for clues on Stanford campus and running the dish to think deep thoughts and unravel the mystery.

Will Ian Michaels be indicted and spend the rest of his life in jail? Or perhaps he will be promoted to that COO job he's been eying for a while, or even leave his company and start his own? And in the process, how many eligible (or non-eligible) women will try to seduce him?

Read the book and find out. Not bad for an author who used to be a high-tech guy himself.

A Trip to the Amazon

You can’t go to Brazil without going to the Amazon. So after giving an invited talk at SBBD 2007 in Joao Pessoa, I headed to Manaus, the capital of the Amazon region (see all photos here). It took 3 flights and some time-zone slalom (while I was flying west all the time, I had to turn my clock forward at one point, and backward at another). I landed in Manaus at 9pm, in the dark. I gentleman named Eduardo was waiting at the airport with my name on sign held up high. We got into a van that started driving towards the docks. It seemed that we were going through some pretty heavy forest in country roads into the darkness of the night (though I was reading a murder mystery during the flight, I tried to keep a positive attitude here).

We arrived at the docks and I was handed off to an unnamed man with a motorboat. I was told that it would take 20 minutes to get to the EcoPark. As I was sitting there, sailing in pitch dark (with only the moon in a very southern-hemisphere position to provide a bit of light), the sensation of adventure started sinking in. The boater turned on his flashlight every now and then to see where we are, and surprisingly, 20 minutes later, we arrived at a lodge, and I met Antonio, who would be my guide for the stay. After a welcoming drink and a short hike in the forest to my cabin, I plugged in my cell phone and went to sleep.

In the morning, after a yummy Brazilian breakfast, I went on a forest tour with Antonio and John, and American fellow who is 10 years into his retirement (he’s 48 now) and whose travel plans for the next couple of months made even my head spin. In the forest, we got to see an original cinnamon tree, a tree from which they produce “aborto”, which, as the name implies, is used to abort pregnancies (and is also a useful post-hangover medication as well). I got to be tarzan for a photo and see Antonio make gunpowder from scratch. Really. As we were finishing our forest walk it started raining (ah, get it? RAINforest, in this case pourForest would be more appropriate), and we were thankful it didn’t start earlier.

After lunch, I was introduced to cashew trees (yes, apparently they grow on trees, not at Costco). The barman took the cashew fruit and made a nice drink out of it. Later we went to an area with a bunch of monkeys playing about, including one with a red head and one that was called cappuccino monkey (if you’d see it, you would understand why). We were taken to see a few folks from the local Indian tribes (and ended up in funny costumes, doing their dance), and then spent an hour fishing on the river. I even managed to catch a catfish that I threw back into the water once the photo-op was over. Sitting on the fishing boat was incredibly tranquil, with sounds of the toucans flying about (and the news of the latest Google earnings report coming in on my cell).

I had dinner with a retired Swedish couple in their late 50’s. The husband ran an international company providing interior design services for cruise ships. He admitted that when a good friend of his came to him many years ago with sketches describing his idea for an ice hotel, he told his friend that he was crazy. Fortunately, the friend ignored his advice and did it anyway. (Oriana and I got engaged in that ice hotel in 2000).

The night activity involved a canoe trip on the river, looking for caimans and listening to the night sounds. We saw only the eyes of the caimans from afar, caught a couple of turtles, and heard many frogs. When we returned to the lodge, Antonio thought he heard lightening, but immediately corrected himself – they were only fireworks. Why fireworks, I asked? Because there was a soccer game in Rio between two teams from Rio, and the people in Manaus were very happy with the result. And it wasn’t even a terribly important game. But that’s Brazil for you! Celebrating a soccer victory of a team from a city thousands of miles away in a relatively unimportant game, is still good enough reason for fireworks.

The next morning, before going to the airport, we managed to squeeze in a speedboat trip to the meeting of the waters – the place where the Rio Negro and Rio Solimoes come together to create the Amazon River (that then flows to the Atlantic ocean). It was really fascinating to see the two distinct waters (they’re different in color, temperature and PH).

Five hours later I was in Miami, and six hour after that in the bay area. What a transformation!

SBBD 2007

I just returned from a wonderful trip to Brazil, where I and another funny looking person, gave keynotes at the Brazilian Database Conference, SBBD 2007. The conference is held jointly with the Brazilian Software Engineering Conference, and had over 600 attendees. You can see all my photos here.

The conference was held in Joao Pessoa (read: John person; apparently he was a poet), which is in the eastern-most point in South (or North) America, the closest place to start a swim to Africa. The venue was a tropical beach resort. I'm not sure when the residents of Joao Pessoa sleep. They seemed to be dancing all night, and at 5am the road along the beach was closed till 8am so people can exercise peacefully. I guess they need all the dance and exercise to burn off the calories from their great food.

Many thanks to Altigran Silva for inviting me and being a wonderful host, to Juliana Freire for all her incredible help and company, and to all my Brazilian friends for their great welcome (yes, my Orkut network has grown exponentially as a result of the trip!)

A Geography Quiz

China and Brazil. Both very big countries. One of them has a single time zone throughout and the other currently has 7 time zones. Which is which?

You were wrong.

The answer is: Brazil 7, China 1. (Sort of what you'd expect if they played soccer against each other).

I was in China earlier this year and went as far west as Tibet (parts of which are about as far from Beijing/Shanghai as San Francisco is from New York). Still, a single timezone throughout.

In Brazil the situation is much more complicated, as the country is challenged both in longitude and latitude. Ordinarily, Brazil spans 4 time zones. However, now they just moved to daylight saving's time.

If you live close to the equator, then the concept of seasons is rather abstract, so DST makes no sense and the northerners ignore it. As a result, the northern part of the country spans 4 time zones on standard time, and the southern part spans 3 time zones on daylight savings time (I realize this means there is an overlap between the timezones so if you're being strict there are only 4 or 5). But if you're flying around Brazil, as I'm doing now, there are 7 zones for you to grasp with.

I flew in to Rio on Saturday, a few hours before they moved to DST. On sunday morning, having successfully woken myself an hour early, I flew to Joao Pessoa, where they're not on DST. Needless to say, things are messed up a bit, but in Brazil you just drink a little more and relax.

The Oxford Murders

Want to read a murder mystery with a mathematical twist? "The Oxford Murders", by Guillermo Martinez is a nice one (apparently, the movie is coming out next year).

A series of murders is happening in Oxford, tangentially involving people in the math department there. Supposedly, there is a mathematical series underlying the murders, and a couple of mathematicians are trying to figure it out before the series goes too far. A very nice and rather quick read.

VLDB 2007 Trip

I just came back from VLDB 2007 in Vienna and from giving a 2-day data integration course at the University of Aalborg, Denmark.

Traveling in Europe is always fun. I find it much more relaxing to assume (with some loss of accuracy) that the Euro and the American dollar are about equal in value. It seems more affordable this way. I've never been to Vienna -- a very nice place to visit. After spending 5 days there, I had 5 schnitzels (a favorite childhood dish of mine) and a large but finite number of Viennese cakes & tortes. Fortunately, I did have the opportunity to go for a couple of jogs while there, so am still able to fit into my clothes.

In Aalborg I gave a course on data integration. This was the first time I gave lectures based on the first few chapters of the book I'm writing with Zack and AnHai. The energetic students in Aalborg helped me debug the slides and the presentation, and overall it was a great experience. Though I knew this before, the database group in Aalborg is a very strong one and doing some exciting work (I was initially surprised to see that all the rooms in the department were labeled DatalogI, but then I was told this means computer science in Danish). My host, Christian Jensen, was very kind, and after making sure I got my exercise, took me up the coast to some very charming towns.

VLDB was very interesting. Yet again, I was pleased to see a lot of work going on in the area of data integration, uncertain data, web, etc. I attended two excellent tutorials: Adaptive Query Processing by Zack Ives and Amol Deshpande, and on Probabilistic Graphical Models and their application to data management, by Sunita Sarawagi and ... Amol Deshpande.

The high point of the conference was no doubt the video shown during the 10-year best paper award talk by Surajit Chaudhuri and Vivek Narasayya. The extremely hilarious video showed Surajit giving a demo of Auto-Admin along side Bill Gates during Gates' keynote at SIGMOD 1998. To put it nicely, the video showed Surajit's ability to mask certain unexpected mishaps during the demo and make it all appear to go extremely smoothly.

A few words on DBclips. Everyone I speak to says it's an excellent idea. However, so far, very few people created them for their papers. Since Luna posted the DBclip on our paper less than 2 weeks ago, it's already received over 180 views. Need I say more?

There are simply too many interesting talks and other events going on in parallel during most conferences. There is no way anyone can make it to all the talks of interest to them. Sadly, even people with the best intentions will not have time to diligently go through the proceedings and read all the papers either. A DBclip is an excellent way to reach a wider audience of people. While it takes some effort, it's well worth it. In fact, during my course in Aalborg I showed our DBclip instead of lecturing on data integration with uncertainty. I imagine that these clips will be a useful teaching resource in many graduate courses. Fortunately, it's not too late. You can create your DBclip after the conference (in fact, there are advantages to doing it now).

Web 2.0 Panel

This post is an experiment. We're holding a panel at VLDB 2007 on Web 2.0 and Databases. What are the opportunities for database research?

Our panelists are Sihem Amer-Yahia, Gerhard Weikum, a Donald Kossmann lookalike, Volker Markl, Anhai Doan and myself.

Given that the topic is Web 2.0, we thought the audience should post comments and opinions throughout the panel. Feel free to say anything. We'll monitor the blog during the panel and highlight your comments.

Comment away!

VLDB 2007 has announced a new program called DBclips. You can now upload a 5-minute video of your VLDB paper. (This idea may sound familiar to the readers of this blog).

You can listen to my first DBClip, created for the paper I wrote with Luna Dong and Cong Yu, on Data Integration with Uncertainty (also an idea that's been mentioned in this blog).

I can't wait to hear others' contributions!

The Lives of Others

The Lives of Others recently came out on DVD (i.e., now available to parents of young children). We just saw it -- highly recommended!

I will not even try to write a review of this movie. Just google for reviews.

Anant on the Database Blog

Another fine post by Anant.

Sergey's Story, in Chinese

A few months ago, Moment Magazine published an article written by Mark Malseed telling the story of Sergey Brin (Google co-founder).

My wife Oriana, in addition to being a lawyer, is quite a skilled English-Chinese translator (that's an understatement, believe me). She translated the article (under the obvious assumption that anything that is interesting to Jews would be interesting to the Chinese). And in fact, her translation was recently published in the Sunday Weekly Magazine editions of the SingTao Daily, the most widely circulated Chinese language newspaper in North America.

You can get the full translation here.

As with any Chinese-related content on this blog, the usual disclaimer applies: it's all Chinese to me.

Two Zodiacs Away

I feel like I should have something to blog about on my birthday. Next year, it won't be a problem -- the Chinese have scheduled the opening ceremony of the olympics for that day.

Generally, I'm a guy who feels pretty comfortable with his age. That's an important survival skill when you work for Google. Every now and then, however, I realize that even though I'm not feeling any older, time is passing.

Today the poignant moment was when I was standing and talking with my group members and interns (all very nice, organized a nice birthday celebration. Special thanks to Bijun who ordered a chocolate babka from Zabars in NYC!). I was chatting with my youngest intern (a graduate student) who mentioned she was also born on the year of the rabbit, like me.

Then there was a pause. It took us both a mere 5 seconds to realize that she's *2* zodiacs away from me. Not one. There is an entire unrepresented rabbit in between us.

While it should be said that she is the youngest graduate student I ever worked with, I still found it startling that I'm working with a rabbit two zodiacs away. I'm sure I'll get over it.

Notes from Ed Tufte

Sorry for being silent in the last while. Blame it on the book I'm writing (on data integration). I had to finish a chapter before I could write anything else.

After several years of getting the brochures in the mail, and watching Tufte's book on envisioning information sit happily on my shelf, I decided to go for his 1-day course in SF and learn a thing or two about effective visualization.

If you want the full experience, I recommend sitting next to someone who worked on Microsoft Powerpoint at some point in their career. I did that, and it added quite a bit to the entertainment factor of the course. Tufte's powerpoint rant starts about 5 minutes into the course and he makes his last jab in his closing remarks. But more on that in a bit.

The course was interesting, even if it mostly gets you thinking about issues relating to effective visualization. I jotted down a few notes that I'm repeating here mostly so I don't forget next time I'm preparing a presentation. Most of them are obvious, but that doesn't mean they're not often forgotten.

  • more detail in your presentation increases your credibility
  • more detail does not necessarily imply clutter (if done right)

  • annotate anything you can in a visualization. For example, annotate links (otherwise you're implying that they all mean the same)
  • don't try to be too fancy. Focus on the content not on the design. For example, tables are a very effective, yet simple presentation. Order the rows in the table according to some performance measure you're trying to emphasize.

  • don't focus on being original in your visualizations, focus on getting it right (don't innovate, steal).
  • get users out of the decoding business (i.e., remove legends where possible)

The deeper point made in the course is that principles underlying creating effective visualizations mirror the principles that underly thinking processes. Hence, for example:

  • Make and show comparisons between different aspects of the data
  • Make sure causality of effects is emphasized in the presentation
  • Build credibility -- make sure you show all the data rather than just cherry-picking what's convenient for you.
  • Enable the audience to drill down and see more data.
  • Integrate evidence from multiple sources (aha, a plug for data integration!)
  • Always give the source of your data (yes, lineage, folks!)

Following the principle that good presentation should support critical audience thinkers, Tufte also points out what audience members should keep in mind as they listen to a presentation (surprisingly, reading your email on the blackberry is not one of them)
  • What is their story?
  • Can you believe them? Do they have a any conflicts of interest affecting their perspective? What's their track record? What's their reason for bias?
  • What precisely does their argument apply to? What are its limits?
  • What do I really need to know when I leave the room?

Ok, back to the powerpoint issue. I actually found myself a bit confused throughout the main point of his powerpoint rant, but I think I get it now. His basic point is that powerpoint forces you into a very low resolution presentation mode. He argues that people can read 3 times faster than you can talk. In addition, powerpoint encourages you to leave quite a bit of detail out and summarize everything in bullets. The human brain can take in much more than what you can convey with a powerpoint presentation. Hence, you're not really using your time with your audience very effectively, since there are better methods of conveying information that make much better use of the audience's mental capabilities. For example, he argues that you should come into a meeting with a 3-4 text summary of your points, have your audience read it, and then have a discussion and answer questions.

The latter suggestion makes it pretty clear when his methods are effective and when not. For example, it's a non starter for large audiences (e.g., conference presentations). On the other hand, there are cases where we do this by default (e.g., hiring meetings don't typically involve slide presentations). So, don't dump powerpoint just yet.

Interestingly, Tufte was not following his own advice very carefully during the day. I felt that the principles he espoused could have been communicated more efficiently (but then, perhaps he assumed that some of the audience also had blackberries to attend to).

I just returned from SIGMOD 2007 in Beijing. This was only the second time SIGMOD has gone out of North America, and it was quite a success. The organization, led by Profs. Zhou Lizhu and Tok Wang Ling and the technical program, chaired by Beng-Chin Ooi from the National University of Singapore, were excellent. It’s hard to beat the Summer Palace as a location for a conference banquet.

I wish I could give a technical summary of all the innovations reported at the conference. Instead, I’ll comment about a conclusion I drew from the three keynotes and give a quick plug for my own paper (this is my blog, after all).

SIGMOD featured 3 excellent keynotes from esteemed members of our community. Phil Bernstein from Microsoft Research talked about the progress made on Model Management, a research agenda he started 7 years ago, and on the challenges ahead. H.V. Jagadish from the University of Michigan talked about how to make databases usable, identifying some of the pain points and research challenges. Finally, Gerhard Weikum from the Max Plank Institute in Saarbruecken, Germany, talked about research combining databases and information retrieval. Each of the wrote a paper for the proceedings that is very worth reading.

I found a couple of common themes among the keynotes. First, they are all pushing the community in important and non-traditional directions. I found that extremely heartening. Second, I think the three keynotes, each from a different angle, support the claim that we need a much better understanding of users and their pain points when they work with structured data. That's a very touchy subject for database folks, who are used to spending their time 'under the hood'.

In Jagadish’s case, usability was the subject of the talk, and hence understanding users is crucial. In Phil’s case, he gave the example of generating schema mappings (mappings between disparate databases), and he was trying to get at what the pain points may be there (he argued that in the contexts he’s been considering, producing schema matches, typically the first step in mapping generation, is not longer the bottleneck). In Gerhard’s case, the question that comes up is what is the right answer when we combine DB&IR in a single system, i.e., what is the real user need. In a DB system, the semantics of the query clearly dictate the answer, but when you combine structured and unstructured data, it's no longer clear what the ranking criteria should be.

The point I’m trying to make is the following. As a community, we need to study user needs as they work with structured data, whether they are creating data, trying to understand existing data, formulating queries or creating mappings. Importantly, we need to keep in mind that a user’s task is rarely just to get an answer from a structured database. Users are typically working with both structured and unstructured data, and their tasks are broader than a single query. A useful interaction with a system is one that brings them closer to completing their task (I know, this his fuzzy, but that's why it's research).

It is tempting to push these problems to the HCI community, but I would argue this is a mistake. These problems will not be high enough on the agenda of the HCI community (there, if your device doesn’t move or perform magic, it’s uninteresting), whereas for us they are crucial for identifying good research directions and evaluating them. As a community, we need to find a way to encourage research on usability and to learn from the HCI community how to evaluate such research. We need to bring this agenda squarely into our conferences.

I'm not the only one to touch on this topic, and we're not the only community to see this need. A recent report titled “The Landscape of Parallel Computing Research: A View from Berkeley”, argues a similar point about developing novel programming models. I thin visualization is an important component of this research agenda (see Anant Jhingran’s blog post about this very point), and see Laura Haas’ ICDE excellent recent keynote and paper for a very nice articulation of this argument in the context of data integration.

My student Luna Dong presented our paper on indexing dataspaces. This paper has the distinction of being the first technical paper I published with dataspaces in its title. The paper describes a set of indexing methods that enable efficient querying of a collection of loosely coupled data sources (i.e., we do not have semantic mappings between them). Because of the nature of dataspaces, the queries we support enable the users to specify structure when she knows it, and keywords to complement the structure (we call them predicate queries and association queries). The basic idea underlying the solution is to extend the technique of inverted lists from IR to incorporate information about the structure of the data. Importantly, the technique also incorporates hierarchies in the data, and therefore it enables uniform querying data sets that have different underlying structures. Luna performed a series of experiments showing the benefits of our approach, and comparing them to techniques for indexing XML that are the closest contenders to address these problems.

Saturday, June 23, 2007

Tibet -- A Feast for the Eyes

Oriana and I just returned from an 8-day trip to Tibet. You can see a selection of pictures here, and the story below.

In the days before our departure, all I seemed to hear were stories about people who got very sick in Tibet because of the high altitude. In fact, in all the stories, the person who got really sick and had to be flown out was always described as “young, in his forties, and in otherwise very good shape”. I decided to ignore the stories and think positively.

Getting off the plane in Lhasa airport we felt light-headed, like waking up from with a hangover. Had to walk very slowly to the baggage claim (but still able to see the big sign with my full name held up by our guide to be). We spent the first day in the hotel, trying not to move and let the body adjust. (It’s recommended to come to Tibet clean because they advise you not to take a shower the first night as part of the acclimation program.)

By the second day, we were able to walk and even climb the Potala, and by the third day even the headache went away. It took our tube of toothpaste about the same time to get used to the high altitude.

We spent the second day in Lhasa, going to the Jokhang (the most revered religious structure in Tibet), and the Barkhor, a lively market surrounding it. Already there, we experienced first hand the devout nature of the Tibetan people. While there were some tourists, we were overrun by locals making their way into the temple to make their offerings. In other parts of the world I’ve seen little old ladies pushing their way to a bus; here they were pushing their way to the Buddhas to offer anything from barley flour and yak butter to coca cola.

We then went to the Potala, Lhasa’s best known structure, the seat of the Dalai Lamas. It indeed was as impressive as I imagined it to be (yes, from the movies). The night views of the Potala were especially impressive.

The most striking thing about Tibet (and the reason I’ve been talking about going for the last so many years) is the prevalence of color everywhere. It starts from the prayer flags erected on most houses, on bridges and peaks of mountain passes. The colors of the Tibetan clothing are wonderful. The window and door treatments, even in the poorest places are simply mind boggling. Even after a few days in Tibet I was simply amazed and happily taking it in as we drove throughout Tibet.

And drove we did (in the backseat of a Toyota Land Cruiser). Distances in Tibet are significant, and it’s not unusual to get stuck behind an army convoy or have to wait on the roadside for a high ranking official convoy to pass by. The drivers were actually reasonably careful (apparently, traffic fines are high enough). There are mileposts everywhere, so even though you’re often in the middle of nowhere, you can be quantitative about it.

The third day we drove to Shigatse, 260km west of Lhasa, home to the Tashilhunpo monastery, the seat of the Panchen Lamas for many years. (Supposedly, the relationship between the Dalai and Panchen lamas is like the sun and the moon, but there is more to the story than that). The Tashilhunpo was the most impressive monastery we saw and was full of (religious) life. On the way to Shigatse we visited Yamdrok Lake at 5000m (just when we thought we adjusted to the altitude…)

After the Tashilhunpo we realized there we cannot possibly be impressed by another monastery. We had also spoken to a few other tourists who were on their way to the Himalayas. However, changing plans in mid-trip was nearly impossible. Nevertheless, Oriana started a long and drawn out negotiation with our guide, driver and tourist agency that seemed about as complicated as a typical M&A deal she negotiates in her day job.

As the negotiations proceeded, we drove to Tsetang (east of Lhasa), the cradle of Tibetan civilization (and home to the nicest hotel & breakfast we had). After a 40km drive on a very bumpy and windy road we arrived at Tibet’s first monastery, the Samye monastery. We were proven wrong; were blown away by Samye as well.

The next morning, with the help of a promise of a tip, the negotiations came to a close and we started driving westward towards the Himalayas. We started at 7am, and drove for 12 hours through multiple mountain passes, very rural areas of Tibet and a couple of other hurdles that I was advised not to blog about. At 7pm, we were standing at 5000m elevation, looking at Mount Everest and its sibling peaks (Makalu, Lohtse, and Cho Oyu). The scene was definitely worth the drive, though perhaps one can argue that the sight of Rainier from Seattle is about as impressive. Naturally, the only other tourists with us at that vista point were a bunch of young Israelis. (Except for multiple groups of Israelis, we saw some French, some Americans and at some points, what seemed like the entire nation of Japan). At 7pm we started a 300km drive back to Shigatse, the closest place with a reasonable hotel.

Driving through Tibet we noticed that it probably has the highest per-capita number of pool tables. Pool tables were adorning the sides of the roads in the most rural villages. In some cases, they were actually used for playing pool, and in others, as stands for decorative items. Our guide didn’t offer a convincing explanation of the phenomenon. Cell phone reception, even in remote parts of Tibet, was typically better than in my home or office. That enabled me to read (and mostly ignore!) my email.

From the culinary perspective, once you get over your craving for yak meet, yak milk, yak butter, your main choices are Nepalese, Indian and Chinese food. I did manage to find a descent cappuccino in Lhasa!

I will not make any political comments on Tibet here, but will mention one anecdote. Apparently, in the past few months, two groups of American students went to Everest base camp and staged pro Tibetan independence demonstrations. As a consequence, except for making it harder for others to reach there, their innocent Tibetan guides and drivers were put into prison for 5 years. So if you’re going to make political statements, make sure you understand the local dynamics before you put your friends at risk.

In summary, Tibet is a wonderful place to visit. Although it is modernizing very rapidly, you can still see the old, especially if you head out of Lhasa. Be prepared for long drives a few dirty toilets here and there. But make sure you bring a good camera (thanks Pandu for the great recommendation!) and take it all in! I’m sure that of all people, my mother-in-law Helen is the happiest we made the trip because now she doesn’t have to hear me talking about going there anymore.

SigTube: 5 Minute Presentations of Conference Papers

The success of YouTube and the like have proven a very simple point: a 5-minute video is a very effective mode of communication. People love to see stuff in 5-minute nuggets (or less). I'm suggesting we learn from this observation for better dissemination of scientific results.

I'm proposing that along with every paper published in conference proceedings, we also create a 5-minute video presenting the highlights of the paper, and make the presentations available on the web for free. I'm calling this SigTube (mostly to encourage people to come up with a better name).

A 5-minute presentation (done well) can give quite a bit of information and insight about a publication, certainly more than the 100-word abstract or the paper's introduction. I know I would love to sit through a bunch of these from time to time and learn more about what's going on in my field, even in areas that are farther away from my main interest areas (in fact, probably mostly in such areas). A video also captures the enthusiasm and emphases of the speaker (not to speak of the fact that it preserves their youth for eternity!)

With today's infrastructure and technology, this is pretty much trivial to do. A conference can dedicate a person with a video camera who will film the videos during the conference. Alternatively, some authors may prefer to film the videos on their own and send them in.

Any takers?

Me and Web 2.0

My Web 2.0 credentials are really shooting through the roof as of late.

As of yesterday, I uploaded my first video to YouTube. The video shows my daughter (6 y/o) dancing nicely, with my son (18 months) "accompanying" her. As you can see, my son has already attained my level of dancing ability (my daughter passed me a long time ago).

I decided to try out Google MyMaps for fun. I created a map of my life and travels (and had fun doing so). Take a look -- (with the subtle implied hint that I'm happy to be invited to places not yet marked on the map).

Finally, together with Sihem Amer-Yahia from Yahoo!, I'm organizing a panel at VLDB 2007 (Vienna, September) on "Web 2.0 and data management". We have an exciting lineup of panelists that includes Anant Jhingran from IBM (who also blogs furiously), Gerhard Weikum (Max Planck Institute in Germany), Donald Kossmann (ETH Zurich) and AnHai Doan (U. of Wisconsin, Madison). I'm sure I'll be saying more about this panel on this blog, so stay tuned.

Circle of Blue - Eloquent Version

Keith Schneider wrote a much more eloquent post about the Circle of Blue Powwow I went to a couple of weeks ago.

This is a (perhaps very rare) opportunity to directly compare the writing skills of a guy who regularly writes for the New York Times with those of a guy whose readership includes mostly database professionals.

Circle of Blue

I spent Friday with an amazing collection of people mostly from the journalism and photography world. Among others, this collection included a previous photography director for Newsweek and Sports Illustrated, a creator of events such as the opening and closing ceremonies of the Salt Lake City Winter Olympics and the 50th anniversary of Disneyland, authors and writers for various newspapers, picture editor for the Washington Post, the person responsible for managing the US government water policy, a previous director of multi-media for MSNBC, founder and director of the Pacific Institute, and a NASA astronaut. Some of the people attending had spent significant amounts of time in third-world countries working on sanitation projects.

We were all hosted on the Pine Hollow estate, which is an amazing 30+ room mansion on the shores of Lake Michigan, a bit north of Traverse City. The home was built by Leslie Lee, and includes every amenity imaginable to man combined with excellent taste in design.

So what we were we all doing there? This was basically a Circle of Blue powwow. Circle of Blue is a non-profit organization that is dedicated to raising the awareness of the public and policy makers to the diminishing supplies of clean and affordable fresh water. CoB tries to raise awareness through a combination of journalism, photography, film and data collection. Carl Ganter, the founder, is quite an amazing guy and among his other major accomplishments (e.g,. being a photographer for National Geographic) tells a great story of how, through a great work of photography and journalism, he (and others) were able to exonerate a wrongfully convicted father and reveal the real murderer in a case in Illinois. He and his wife Eileen conceived and planned Circle of Blue.

There is no way I can do justice to the entire discussion in a short blog post, nor can I fully convey the tenacity and passion of the people gathered. I will also skip the many details on the major water issues facing our planet (but I will point out that water is one of the few main foci of, the Google Foundation). Instead, I'll just highlight a few points I found interesting from my perspective.

Our discussion focused on how exactly to leverage tools and technology to raise awareness on water issues. The ideas discussed were all over the map. They ranged from creating blue rings that everyone would put on their faucets (following Lance Armstrong's yellow rings for cancer fighting), to using Web 2.0 tools such as blogs, Google My Maps, Flickr, etc. to help people all around the world to create databases of water-related issues, and to mobilizing the religious right to take up their issue in their congregrations.

In a sense, we were trying to figure out how to recreate the success of the green movement, but in blue. While there is much in common between the two global warming issues and water issues, there are also a few key differences between the two. First, in the case of green, there are some simple things everyone can do to help a global problem (e.g., buy a hybrid, go solar). In the case of water, aside from taking shorter showers and watering your garden more effectively, many of the major issues are of local nature and the problems and solutions vary quite a bit. Second, the people suffering from water shortages at this point are typically far away and that makes it hard for the issue to be on people's minds constantly. New Orleans is much closer to home.

The other interesting point about the discussions was how to combine traditional media like journalism, film and photography with newer technology to create viral awareness of the water issues. While it's great to have the high-quality polished artifacts created by these media, we also need the bottom-up YouTube-type videos and blogs created by a much broader and geographically distributed set of people, but with much less skill (myself included...) to really reach people's attention. We need to collect good data, but mostly, make sure the data is used in effective ways for highlighting the issues and garnering world-wide attention.

This was a highly inspiring meeting for me. If you have any ideas, don't hesitate to post a comment, send me email, or contact Carl Ganter directly. I'm sure this topic will reappear on this blog.

The Slash Effect

I just finished reading "One Person/Multiple Careers", by Marci Alboher, Author/Speaker/Coach. Like with other books described in this blog, my reading agenda is highly influenced by authors visiting Google (which happens often enough to keep any reader quite busy).

The main contribution of this book is to get you thinking. Slashers are people who have multiple parallel careers. Through numerous examples, the book claims that this is a growing phenomenon in today's culture, and describes the challenges, opportunities and benefits having to do with slash careers. The point that I found most interesting about all of the above is that slashing essentially gives you multiple identities in society. Think of what you answer at a party when people ask you what you are or do. Being a slasher means you can give multiple answers, or choose one you think best suits the situation. But more than that, slashing means you gain some internal balance in life, rather than being tied to one professional identity.

Marci gives examples of lawyers turned writers & coaches (including herself), a teacher with a modeling career, a computer programmer who also directs a theater, a lawyer who's also a Baptist minister, Sanjay Gupta, the CNN health correspondent who also does surgery a few times a month, and the list goes on and on. She discusses how people manage multiple careers, some of the cross-over benefits and life-style benefits they obtain, and she offers practical advice on how to become a slasher. The book essentially revolves around all these examples, and every chapter ends with the highlights of its main points (great for future reference).

Being a somewhat formal guy on occasion (perhaps one of my slashes?) I found myself looking for a definition of a slash. Marci seems to focus on aspects of life that are part of your career (it doesn't actually matter whether you derive much income from it, otherwise most of the poets and actors would not have made it into the book). But, for example, does a hobby count as a slash? Does it depend on how much time one spends on the hobby? In fact, many jobs are composed of multiple slashes (e.g., professors spend half of their time teaching, half their time doing research, and the other halves raising research funding and sitting on committees).

Clearly, parenting is the most common form of parallel activity adults engage in. The book contains a chapter on parenting and how parenting and slashing share many challenges. The book even claims that a slash life can prepare you better for parenting (though clearly, some of the slashes may take a back seat for a while).

However, my search for a formal definition of slashing is missing the point. As I stated at the outset, the point of this book is to make you think about all the aspects of your life whether they count as slashes or not. Personally, the most common slash combination I've encountered (and personally experienced) is the professor/entrepreneur combo, and I can speak at length about the benefits and challenges there.

Finally, one point that was not addressed in the book is multiple careers that happen in sequence, rather than in parallel. Perhaps I'll take the opportunity to coin a new term: the double backslash, (for those of you who haven't had the pleasure of using the Latex word processor, I should explain that a double backslash creates a new line in the text). I would think that slashing and double backslashing share many of the challenges and benefits.

In summary, this book is a rather quick read (you can skip parts, but pay attention to the boldfaced sentences). I found myself reflecting on my slash/double blackslash riddled career and wondering what other slashes may come into my life at some point.

Chinese poetry recital

This post is a few weeks late, but the parental pride is not diminished at all. My daughter Karina participated in the Northern California championship for Chinese poetry recital. This competition is organized by the association of Chinese schools of Northern California that includes over 70 schools and includes many categories (remember: this is all Chinese, so I'm a bit sketchy on some details). To get to the regionals, she had to win her school competition. She was given several poems and in the competition she had to recite one of them. The competitors were judged mostly on pronunciation (and on not using their hands).

There were 21 participants in her category (she competed in the 5-7 age group, being on the very younger side of that). Parents were not let in the room or even to see the judges before the competition.

As the picture shows, Karina took First place! The victory was immediately celebrated by the biggest ice-cream she ever had, but suffice it to say that her maternal line (i.e., mom and grandmother) did not sleep that night of sheer excitement! Clearly, this is one of Karina's achievements that I made absolutely no contribution to.

Pandu's blog

My good friend and grad school buddy Pandu Nayak recently started a blog. He has a higher ppd (posts per/day) than I do and the posts are on a variety of topics (you'll never get bored).

Pandu and I often IM each other (mostly for coordinating critical issues such as espresso consumption). I think the next step is for us to get MySpace accounts, and then we'll be Web 2.0 compliant. Who knows, maybe with such openness, our children will consider talking to us when they become teenagers!

From Darfur to Robotics

FIRST is an international competition that engages teenagers in technology. Their best known competition is in robotics, but they have a bunch more. My connection to FIRST is through coffee (I had the pleasure of meeting the founder of FIRST, Dean Kamen, next to the espresso machine at Google), and through family (my brother-in-law Asaf Menuhin has been running the FIRST competitions in Israel). The following inspiring story appeared in an Israeli newspaper and was forwarded to me by my parents.

"A" is a 16 year old student who escaped the killing in Darfur two years ago. He somehow got to Egypt, and from there crossed the Israeli border. Initially, he (and the others with him) were arrested. After a while, he was let out and joined the Kfar Yamin boarding school not too far from Haifa, Israel. "A" immediately became a star student at the school. One night, when he was walking around the school he noticed the lights on in the science lab. He went in and saw a group of kids preparing for the FIRST regional robotics competition, and immediately fell in love with robotics. Shortly after, he became the leader of the group. "A" led the group to an impressive 4th place standing in the Israeli regional. He was disappointed because that was not enough to earn him a trip to Atlanta for the finals, but everyone else is still in awe of the huge step this young man made in such a short time. I bet Dean did not anticipate this story when he started this amazing establishment!

Memorial Day

This is a somewhat unusual topic for this blog, or perhaps a sign that I'm finally getting with the spirit of Web 2.0.

Today is Israeli Memorial Day. We remember the soldiers and civilians that were killed in wars and terrorist attacks during the history of Israel. In Israel, this day is taken very seriously (e.g., the notion of a Memorial Day Sale does not exist). As a striking example, at 11AM on this day, there is a siren sounded throughout the country. Every person, and I mean, every single person, will stop what she or he are doing and will stand with respect for 2 minutes. If you're driving a car, you stop the car and stand outside. 2 minutes of complete stoppage. Immediately after the siren, the memorial services begin at all the cemeteries. In the evening, as the sun sets, the country turns from a day of mourning to a day of celebration of its independence. It's quite a striking transition.

For me this is always a very special day. My middle name, Yitzchak, is in memory of my uncle (my father's brother) who was killed in 1948 during the war of independence. He was 27 when he was killed, serving in the Israeli Air Force. I, of course, never got to meet him, but indirectly, he influenced my life quite a bit. He inspired my father to get into chemistry, which took him to academia. In growing up, I never questioned whether I'd end up as an academic or not.

Since moving to the Bay Area, I've been attending the Memorial Day ceremonies organized by the local Tzofim (the scouts). As part of the ceremony, they read out the names of fallen ones who have relatives in this area. They read the names in chronological order of their deaths. It struck me as I was sitting there tonight that I was sitting in anticipation to hear the end of the list, i.e., the new additions from last year. Unfortunately, due to the war in Lebanon in the summer of 2006, the list was indeed longer, and their stories heartbreaking as usual. Let us all hope these lists stop growing longer. There is too much unnecessary pain already in the region.

I just finished reading Wikinomics, by Don Tapscott and Anthony Williams. This is certainly an interesting and thought provoking read. It's interesting from various perpectives, including the structure of current and future businesses, the future of education, and how to think of one's own career (at whatever stage you may be). Certainly, as with many books of this sort, the same message could have been conveyed in fewer pages (this applies especially to the first 40 pages or so), but overall, it's very worthwhile.

The main thesis of the book is that mass collaboration changes some fundamental aspects of running a business. Three forces have recently come together to create the perfect storm that facilitates Wikinomics: (1) technology (basically, Web 2.0 where anyone can contribute to anything), (2) the Net-Gen -- the generation of people who grew up collaborating (think: kids who view email as a thing only their parents do), and (3) the global economy, where companies are forced to reach out and collaborate to produce additional value (i.e., The World is Flat).

Perhaps the most succint description of the principle underlying Wikinomics is a rephrasing of Coase's Law (which was coined around 1937 by an English socialist). The law says that a firm will expand until the costs of organizing an extra transaction within the firm become equal to the costs of carrying it out on the open market. For example, if you're a car company, if it's cheaper to manufacture your own tires than use an external supplier, then you will do so in house.

The observation is that the internet has lowered the transaction costs so significantly, that now the right way to think of Coase's law is that nowadays firms should shrink until the cost of performing a transaction internaly no longer exceeds the cost of performing it externally.

The book then goes on to illustrate examples of the different aspects of Wikinomics:

Peer production: examples of great achievements created by large collections of collaborating peers: Wikipedia, Linux (and more importantly, IBM's embrace of the open-source community as an example of a firm doing the right thing).

Ideagoras: essentially, using the open market for research into your specific problems. The observation being that no matter how many researchers you employ, the person with the best ideas for a particular problem is likely not in your lab (the authors still emphasize the importance of internal R&D though). Here, the main examples are InnoCentive, a company that acts as an eBay for ideas, and Proctor & Gamble, that was in quite a bit of trouble, but managed to tap external ideas to make a comeback.

Prosumers: companies that benefit from their consumers essentially developing their products. The big example here is SecondLife, where the consumers create more than 99% of the content being transacted (and the consumers get the IP rights to anything they create!) Another example is Lego that lets users create and share their Mindstorm creations. An interesting example is that of consumers tinkering with iPods (and Apple not standing in the way, as opposed to Sony not taking that approach).

The New Alexandrians: the creation of new data banks that enable research to proceed faster (the Human Genome Project). One of the interesting discussions there was about reshaping the relationships between universities and companies. The recently created Intel Lablets are an excellent example of that (and I predict this is one we'll see more of).

New platforms: Google Maps, need I say more? Ok, I will. Actually, Amazon was way ahead on creating platforms for others to build on, and today Yahoo, eBay, Google and Amazon are creating exciting platforms for others to create additional services on.

The Global Plant Floor -- companies changing the way they interact with suppliers. Instead of Boeing sending exact specs for each part of a new plane, they let the suppliers design and innovate as much as possible. They also let them assemble much more of the components, and therefore, it now takes Boeing 3 days to assemble an airplane once all the parts have arrived, rather than 30. BMW is another example described here, along with Lifan, a Chinese company that is making waves in the motorcycle industry.

Finally, there is a discussion of how wikis really changed the possible interactions in corporations.

A few thoughts.

First, there is no doubt that there is a lot of evidence of the power and presence of Wikinomics presented in the book. While the examples were very good, they were still few. This leaves one with the feeling that Wikinomics may remain a fringe rather than mainstream in business.

Second, one may wonder whether we didn't hear all this in The World is Flat by Tom Friedman. Certainly, there are many interesting relationships between the two books. I think Friedman addressed a narrower aspect of the picture: basically that companies can distribute themselves across the world and work more aggressively with partner companies. Wikinomics goes one step further and discusses how companies should leverage the masses, not just partners, and how that affects the way we think of IP rights and communications within corporations.

Third, as a computer scientist, I wonder what CS has done about all of this. While we've been responsible for many of the technologies that created the tools mentioned in the book, I'm not sure we leveraged the tools ourselves to benefit our own research. Certainly, in education we haven't. In fact, Don Tapscott argues that are very few industries that have changed very little in the past century, but education is one of them. It's still mostly built around teachers standing in front of students and lecturing. Food for thought.

Finally, there is a lot of discussion in the book on IP issues and generally, on how business relationships between companies should be structured. I think every lawyer should read this book, and hence, am moving the book to my wife's side of the bed.

The J Curve

I've decided to start writing short summaries of (some) books I read. Please do not expect literary pieces here -- this is not my attempt to sneak into the book review section of the New York Times, nor is it an attempt to make up for a few missed book reports in grade school.

Had I started this two years ago, I would have definitely blogged about "The World is Flat" by Tom Friedman. Too late now. But I think this book should be required reading for anyone interested in data management (either in research or industry). Just read that book and imagine all the data management services we need to invent to support the flattening of the world. Of course, the book has other merits too.

The J Curve by Ian Bremmer, is essentially a framework for discussing the stability and openness of countries and how you can understand political events in the context of moving on the curve. You need to imagine the letter J rotated about 45 degrees clockwise to get the full effect. The X axis of the J curve is the degree of openness of a country (e.g., travel restrictions, freedom of the media, economic openness, the presence of independent political institutions). The Y axis is the degree of stability of the country (i.e., whether certain events would cause great chaos or not). If you're on the top left of the curve, you're closed but stable (e.g., North Korea). If you're on the top right, you're open and stable (e.g., USA, western Europe). If you're China, you pose an interesting challenge to the curve (more on that in a moment).

One of the main points the book makes is that for countries to go from the left to the right they will have to first go down the curve and therefore suffer some considerable instability. The world can help these countries by raising the entire curve, i.e, make the depths of the curve stable enough so countries will survive the transition. (In practice, that observation lets Bremmer criticize many of the policies the USA has taken w.r.t. some countries).

The book goes through a few examples of countries in each part of the curve. It starts with North Korea, Cuba and Saddam's Iraq as examples of stable but closed. It discusses Iran, Saudi Arabia and Russia as countries that have the potential of sliding down the curve from the left. He shows South Africa as an example of a country that made it through the transition successfully and Yugoslavia as one that didn't. He takes Turkey, Israel and India as examples of countries founded on the right hand side of the curve and who have maintained it that way (though they do face challenges going forward). Finally, there is a chapter on China, where Bremmer argues that despite its economic openness, China is still on the left side of the curve.

What I liked most about this book are the brief yet insightful summaries of the relevant history of each of the countries discussed. The summaries give you the background for why things are the way they are now and let you understand better the challenges facing the countries. I'm finding that it's easier for me now to put current news into context, and in fact, that the curve does give a pretty good framework for thinking about today's world events.

That being said, the chapter on the country I am most familiar with, Israel, was a bit disappointing, so perhaps people from other countries would say the same about their respective chapters. I enjoyed reading about all the complexities in Yugoslavia, though I wished he would have spent a page or two describing some of the main events of the war there (he stopped just before it saying it would be too complicated).

In discussing the book with Donald Kossmann, he wonders whether corporations can also be classified according to their openness. For example, a company who has a very closed set of programmatic interfaces to their products (and I won't name the names he mentioned) may be considered a North Korea of countries. An interesting thought to develop.

In summary, I would definitely recommend this book. If you're an expert on world affairs and attended all your history classes in school, you may find less of a payoff from reading it.

I'm moving on now to Wikinomics.

Uncertainty and data integration

A short rant about the relationship between databases that manage uncertainty and lineage and data integration systems.

Recently, there has been renewed interest in building database systems that handle uncertain data and its lineage in a principled way. The Trio Project at Stanford and the MystiQ Project at the University of Washington are just two examples (I collaborated for a while on the former, and watched the latter up close while I was a professor at UW).

I think this is a great research area and certainly (no pun intended) a very timely one. I want to make two points though (one of which may raise Jennifer Widom's blood pressure).

First, I think data integration is the killer application of uncertainty and lineage (ok, maybe there is a second -- sensor networks). Fundamentally, data is uncertain when it comes from external sources and some of the transformations it went through on the way are not necessarily correct.

In fact, I think one of the greatest challenges for data integration research is to build data integration systems that deal gracefully with uncertainty (uncertainty can be about the underlying data, the schema mappings and the mapping of keyword queries to structured queries). If you have good ideas about this, please do contact me.

My second point is that there is really no argument here. In fact, I believe that once a database system is able to model and process uncertain data and its lineage, much of the distinction between traditional database systems and data integration systems goes away.

Specifically, by modeling data lineage and that it may be uncertain, you're admitting that the data came from somewhere, and that you're not sure about the transformations the brought it into the database or about its intrinsic meaning. That's exactly what data integration is about -- modeling data that comes from multiple sources. Unlike ordinary databases, where the data might as well have been born in the database because you know nothing about its past, databases with uncertainty and lineage admit that data had a prior life.

So then what's left of the difference between databases and data integration systems? Mostly issues having to do with query processing over remote sources.

I should emphasize -- I'm not claiming that these problems are solved (quite the contrary, see my comment about about data integration with uncertainty). But I do find it quite appealing that a database system models the fact that data came from the outside. That's the way it typically is in the real world, and it's about time databases realize it too.

Structured data and the web

One of my main areas of focus at Google is on the relationship between structured data and web search. There are now vast amounts of structured data out there, mostly in the deep web (i.e., in databases behind HTML forms), created by annotation schemes (e.g., Flickr, Google Co-op, etc), and Google Base. The question then is how to use this data to improve the results of web search.

I recently published two related papers on this topic, one at CIDR 2007 and one in the Data Engineering Bulletin (go to Page 19 of the issue). You can read the papers for the details, but I'd like to highlight two key points from these papers that should be kept in mind when researching this area:

  1. Integration: Whenever you cook up an idea about how to improve web searh by leveraging structured data, or by automatically structuring data on the web, you need to keep in mind how your technique will integrate with other web searches. Users want to go to a single search box to find all their result. So whatever technique you come up with, needs to mesh well with other techniques used by the underlying engine.
  2. Data about everything: Many ideas work well if the domain of the data is constrained (e.g., you know you're building a portal to search for cars, housing or job listings). But on the web, data is about everything. There is no domain or set of domains that covers all data on the web. In fact, it's not even clear when one domain ends and another one begins. So try to imagine what it's like to deal with data about everything. That changes a lot in the way you think about a problem!

Ein Gedi

I recently went for a quick trip to visit my family in Israel (with a short stop in Amsterdam, where I saw an amazing exhibit).

While in Israel, I went for a hike in Ein-Gedi, one of my all-time favorite places. The trip was organized by the Tova Milo's database group at Tel-Aviv University. Check out the pictures from Ein Gedi (and a few others).

While Ein Gedi has always been a place for me to find complete peace, staring into the Dead Sea, this time my blackberry made it a slightly different experience.

I've finally decided to start blogging and am very excited about it!

The posts on this blog will either be about work (i.e., data management ideas) or my family. No politics (you probably don't want to hear it anyway), but possibly a passing comment on coffee or other exciting events.

In way of background, until recently I've been a professor at the University of Washington. I moved to Google in September 2005 and lead a group looks at how structured data can be used in Web search. As I publish papers about this work, I'll summarize them here. For all my publications prior to coming to Google you can check out my UW web site.

One of the goals of this blog is to get people in the data management community to share novel ideas and discuss them. While technical results are well served by pubished papers, Web 2.0 gives us the opportunity to discuss ideas outside our conferences quite easily.