Wednesday, February 23, 2011

Telling Time/Date in Ethiopia (also, a good introductory programming exercise)

As the regimes in Northern Africa were getting under intense pressure, I flew to Ethiopia to visit coffee farms and experience its coffee culture. I'll write about the coffee aspect of my visit later, but I wanted to share an interesting cultural anecdote first.

Before going to sleep in a traditional Sidama hut, my host announced that we'll be having breakfast at 1:30. I looked puzzled, as I wasn't sure if he's letting me sleep in really late or planning a very early rising. Upon inquiring, I unraveled a fascinating aspect of Ethiopian culture -- the way they tell time and dates.

We'll start with time. The Ethiopian day starts at 6am. So when they say 1am, they mean our 7am (so my breakfast was actually going to be at a reasonable time). Essentially, they are 6 hours off.

Now for dates: the Ethiopian year starts on September 1st (at 6am, of course). They have 13 months. The first 12 months each have 30 days (none of this 31/30 days). The last "month" is 5 or 6 days, depending on whether the rest of the world had a leap year or not. So now, in our February, it's June in Ethiopia (it certainly felt that way).

To make things a bit more complicated, it's now 2003 in Ethiopia. On September 1st, 2011, the year 2004 will begin.

You now have all the information you need to write a date converter into Ethiopian date/time. Seems like a fun programming exercise for an introductory course.

Sunday, September 12, 2010

The value of eco-system services for coffee production

For reasons that will become clearer in a few months, I've become somewhat interested in bio-diversity preservation and eco-system services. Examples of ecosystem services are pollination by bumblebees, decomposition of wastes, and flood mitigation and carbon sequestration by forests.

One of the main problems with ecosystem services is that it is hard to attach to them a monetary value. As a result, decisions to cut down forests, develop lands and interfere with water flows often undervalue these services. Even if there is a value attached to the service, the fact that it is provided by nature makes it harder to tell who should pay to preserve it.

I was recently shown a nice example of where the value of an ecosystem service has been quantified, an no less, in the area of coffee production! In [1], Ricketts et. al show the value of having a forest close to a coffee plantation. They conduct this study in a coffee farm in Costa Rica, and show that within a distance of 1km from the forest, the benefits of forest-based pollinators (i.e., diversity of bees) can increase the production of the farm by 20%. This observation, as they show, can be directly translated to a monetary value. The basic reason that the proximity of the forest is important is that the diversity of bees in the forest enable better cross pollination among plants (whereas, for example, honey bees typically focus on single branches when flowers are dense). Interestingly, the diversity of bees also reduced the number of peaberries produced, which may be slightly more controversial if your goal is to make money off peaberries (which some do).

Thanks to Gretchen Daily for sharing her article with me.




[1] Taylor H. Ricketts, Gretchen C. Daily, Paul R. Ehrlich, and Charles D. Michener. Economic value of tropical forest to coffee production. Proceedings of the National Academy of Sciences, August 24, 2004, Volume 101(34). Pages 12579-12582.

Friday, June 18, 2010

Fun with World Cup Soccer Statistics

As a teenager I was curious about which minutes in a soccer game are the most likely to have goals scored. I wrote a computer program that stored a database of all the goals scored in the Israeli soccer league for an entire year. I diligently went through all the sports sections of the newspapers and entered all the goals and minutes in which they were scored (feeling very mature that I was able to ignore my strong feelings about some of these goals). I calculated the statistics I was looking for, and the answer was: minute 65 was the most goal-rich minute.

Now, a few decades later, as the 2010 World Cup begins, I find myself asking the same question, or rather, revelling at how easy it is to capture the data, compute the statistics and share them with everyone in the world.

Using Google Fusion Tables, the tool developed by my team at Google, I created the visualization below. We're updating the underlying table as more goals are scored, so you'll always see the latest stats.




But that's not the end of it. Fusion Tables is a tool for data integration. We found some data on fifa.com and joined it with our own table, and then created more interesting visualizations.



This one shows the height of the goal-scoring players. Read into it what you want.




This one shows the distribution of goal scoring among defenders, forwards and midfielders.



And finally, this visualization shows the clubs at which the goal scorers play.

Saturday, February 6, 2010

The Checklist Manifesto

I just finished reading "The Checklist Manifesto" by Atul Gawande, a very interesting book.

Gawande, a surgeon, essentially makes the following point. Given the incredible amount of knowledge we have accumulated in some professions, the complexity of certain tasks could be incredibly overwhelming to professionals (e.g., surgeons, airline pilots). Since in many situations these professionals work under pressure, they often forget some very simple yet important steps that later create unforseen problems (e.g., making sure the antibiotics are applied at a particular time before the incision is made into the patient).

Hence, he argues for simple checklists that teams should go through to ensure that important details are not glossed over. In the airline industry, checklists are used religiously. At every step of the flight, or whenever anything goes wrong, there is a checklist for the flight crew to follow. Gawande's main argument is that this principle should be applied in other professions as well, and in particular, in medicine. He describes his experiences launching such a checklist program with the World Health Organization and the impact that it had on reducing complications following surgery.

There are two main challenges this strategy. First, the checklist needs to be short as to not to completely slow down work. Hence, choosing and phrasing the items on the checklist requires significant thought. The second challenge is putting ego aside. For example, surgeons are used to being the kings of the operating room, and do not lightly take comments from nurses or other staff. Well, pilots have gotten over it, and they're not slackers in the ego department.

Gawande also gives examples from the construction industry and from restaurants, where constructing a high-rise or making sure that everything comes together at the right time on a customer's plate can be rather challenging. One main observation he makes from all of these examples is the importance of communication among the team members, in addition to the checklist. It is crucial for members of the team to communicate well with each other and building communication into the workflow is key. In that way, it's less likely that things fall between the cracks leading to additional problems.

To me the book was interesting because it points out that even if we build a huge body of knowledge in a particular domain, applying this knowledge in practice can be equally challenging.

A Trip to Australia

I recently returned from a trip to Australia, where I gave a keynote at the Australasian Computer Science Week, the annual gathering of computer scientists from Australia and New Zealand. You can see a journalist's account of what I talked about here.

There is a small but very strong database community in Australia, and I encourage anyone who has a chance to go down under and visit. The strength of the community was apparent when two of the three major annual awards were given for database work. Heng Tao Shen from the University of Queensland received the Chris Wallace Award. This is the top prize given for technical achievements across all fields of computer science (full professors are not eligible for this prize). Heng Tao made his mark spanning the fields of databases and multi-media.

The second award was the Ph.D Thesis Award that went to Michael Cahill who received his Ph.D from the University of Sydney under the guidance of my friend (and excellent cook!) Alan Fekete. Michael and Alan also received the Best Paper Award at SIGMOD 2008 for this work on serializable isolation for snapshot databases.

I was very fortunate to spend time with these winners. Heng Tao was my wonderful host in Brisbane and helped make a long-time dream come true -- sitting on a sunny beach in the middle of January (in Gold Coast). When I went to Sydney, Alan took me to an espresso machine making factory, where I got to see up close how these machines are made!

The coffee in Australia is amazing, and will be the subject of a different post. But if you're going to Australia and need coffee, check out my list of favorite cafes and you'll be happy.

Monday, June 29, 2009

Fusion Tables: The third piece of the puzzle

When I joined Google in 2005, the goal of my group was to explore the different aspects of structured data and the Web. The first and most burning need was to address the deep web, the collection of databases stored behind forms and invisible to search engines. We developed a completely automated system that has crawled millions of forms in over 50 languages and hundreds of domains. The system surfaces pages from the deep web by guessing good queries that can be posed on the forms, and inserting the resulting HTML pages into the Google index. These pages are shown in the top-10 results for over 1000 queries per second. For all the details, see the VLDB 2008 paper by Madhavan et al.

In a second project, we explored the collection of tables that are already on the surface web. We found over 150 million high-quality tables and developed a search engine for tables (see the VLDB 2008 paper by Cafarella et al. for the details). We also showed how to leverage 2.5 million table schemas that were part of this collection. This collection is now available to the research community.

On June 9th, we launched Fusion Tables, that represents the third piece of the puzzle of structured data and the web. The main goal of Fusion Tables is to make it easier for people to create, manage and share on structured data on the Web. Fusion Tables is a new kind of data management system that focuses on features that enable collaboration. We started with a relatively small set of features, but we’re rapidly expanding them, keeping our users’ requests as our top priority.

You can read the official announcement of Fusion Tables, and a great example of how it is used for data collection in the domain of water. In a nutshell, Fusion Tables enables you to upload tabular data (up to 100MB per table) from spreadsheets and CSV files. You can filter and aggregate the data and visualize it in several ways, such as maps and time lines. The system will try to recognize columns that represent geographical locations and suggest appropriate visualizations.

To collaborate, you can share a table with a select set of collaborators or make it public. One of the reasons to collaborate is to enable fusing data from multiple tables, which is a simple yet powerful form of data integration. If you have a table about water resources in the countries of the world, and I have data about the incidence of malaria in various countries, we can fuse our data on the country column, and see our data side by side. Importantly, we can do this while maintaining complete control of our own data.

Collaboration is not only about integration. Once the data is visible side by side, we may want to discuss it to understand it better or resolve conflicts. With Fusion Tables you can discuss data at multiple levels of granularity: rows, columns and individual cells. Hence, the data and the discussions are deeply integrated (or should I say, fused?)

Given our focus on collaboration, there are a lot of things we do not do (and we're pretty honest about it!). We do not support complex SQL queries or high throughput transactions. Despite our love for query optimization, we’ve implemented very little of it in the current system. We will, of course, add to these capabilities with time, but our real goal here is to explore data management for a broader audience of users and needs.

Please try it out and send us feedback! Our top priority now is to respond to our users' needs.

Monday, March 9, 2009

Coffee: a Competitive Sport

I went up to Portland, Oregon last week to attend the 2009 United States Barista Championship. No, I was not competing, and unfortunately, not one of the judges either.
You can see all my pictures here.
Portland, by the way, has the largest number of cafes per-capita in the US and has a very active and sophisticated coffee scene (I'm sure you appreciate how hard it is for an ex-Seattle resident to admit this). If you're in town, check out Stumptown Coffee.

The competitors came from all over the country, including the expected share of west-coast baristas and even a guy applying all the charm of a cowboy into his espresso drinks. There were quite a few baristas from Intelligetsia Coffee & Tea, including 4 out of the 6 finalists and the champion, Mike Phillips.

So how do these folks compete? They basically put on a show for 15 minutes, which initially can be quite deceiving because they look incredibly relaxed. The show includes their choice of background music and often some accents in their clothing. In those 15 minutes, they need to prepare espressos, cappuccinos and their "signature drink". They prepare 4 of each, for each of the sensory judges.

All this while, the competitors need to show deep knowledge of their coffee, beginning by explaining their choice of blend, and how each of the flavors comes out in the drink. As they prepare the drinks, they are closely watched by a couple of technical judges, who are watching for every little detail of handling the espresso machine, waste management and timing. Multiple video cameras are following them very closely as they do this, and every now and then the emcee will elicit a cheer from the crowd (let's have it for Mike's first 2 espressos!). If you want to see an example of a wonderful performance, watch the performance of Stephen Morrissey from Ireland when he won the 2008 World Championship.

It was a fascinating crowd from all walks of the coffee industry. There were many spectators in the bleachers, some were huge coffee fans and others who wondered how exactly they got there, but were having a great time anyway. And of course, there was an amazing buzz on the floor around each of the competitors' bars -- after all, everyone in the room was caffeinated...

Finally, the awesomeness of the experience came out most poignantly when I was having a conversation with one of the other attendees and I mentioned to him that I work for Google. He asked: what part of Google do you work for? Food services?