Tuesday, July 24, 2012

Map-Reduce Gamification

Why do we need a Map-Reduce Game?

Teaching how to write Map-Reduce jobs is not an easy task. You can focus on the concepts, the framework, the benefits, the problems or any other aspect of this paradigm. But if you want to give an overview that covers most of these aspect at once, you need to think differently.
Certainly not every developer should know when and how to write Map-Reduce tasks, and certainly not every company has the need and ability to run these tasks. Not every company is Google, nor in need to process Big Data in big scale. But as many technologies, learning it opens up a new set of opportunities and leading to the creation of new set of capabilities.
The availability of FOSS as Hadoop from Apache or HortonWorks or on-demand node farm from AWS Elastic Map-Reduce, lower the bar of entry for many more developers and companies.
The main problem remains the way developers and companies are thinking about their problems. Since Map-Reduce is a different way of thinking of problems (and solving them), it requires an effort to learn it before it can be adopted.
One way to solve this problem is to set up a small dedicated team that is learning "Big Data" in places like IBM's Big Data University or Yahoo's HortonWorks Seminars. The companies can also rely on enthusiastic developers who will watch online lectures, like Google Code University. But these small non-committing steps are not going to lead the company to the scale of Map-Reduce adoption in Google. But why should the companies spend more efforts on such a technology, which looks much more complex than the already not simple technologies currently used. Classic Chicken-Egg scenario.

What is the Map-Reduce Game?

I wanted to have a game that will involve all the people in our R&D department. The game should demonstrate the main aspects of the technology, but in a light way. Most of the introduction slides on the subject are usually too technical, and alienate most of the people.
The game is designed as a competition between two (or more) groups. Each team consists of a few people, where each person is a processing node for Map-Reduce jobs. The two teams are given a task of calculating some maximum value from a long list of values to perform, in a limited period of time.
Before the start of the game, the teams were given an example of the data input, and were told that they have a few minutes to decide how they plan to perform the task. Then each member was sent to his desk in different rooms to work on the task according to the plan. They could only communicate through IM, and only a single team member was allowed to walk around the various rooms.
The idea was to allow the team to "write" the Map-Reduce algorithm, and then to execute it. During the execution, they could modify the algorithm, using the "runner" as they discover optimization opportunities. They had to resolve "Server Crash", when I told a team member from each team to stop working for a few minutes. They also needed to handle different processing speed as some of the team members were much faster in the task than others. In short, they discovered during the game many of the aspects that are part of a real life Map-Reduce process.

Detailed Game Rules

I choose to use the example of calculating maximum temperature, similar to the example task in the book "Hadoop in Action" by Chuck Lam. The idea was to encourage the graduate of the game to further reading on this topic, and to find the example familiar. I used a lighter version of the data, to allow human processing, for monthly summary of a single location. You can see the example input data in this Google Spreadsheet, which I printed out for the two teams.
The task was to find for each year from 1966 to 2010, what was the month with the wider range between lowest and highest temperature. See example below:
Example of the input data for the Map-Reduce game
Each "node" received part of the printed pages to work on. Some of the years were cut between pages and the "Reducer" should be able to integrate the result of such year from couple of "nodes".
After a short introduction about "Map-Reduce" with the classical "word count" example, a sample input page was presented to the teams, and the game rules were explained.
Then, the teams received the first input page for their 5 minutes planning discussion, and then each team member, but the "runner" went to their desks. The "runner" received the rest of the pages and started distributing the pages among the team members. Only the runner was allowed to physically move between the desks, distributing and collecting the pages.
The team had 15 minutes to execute their tasks, before they needed to report their results.
The score was the number of correct results for each of the years. No points were deducted for wrong answers to encourage approximation and optimization techniques.

Highlights from the Game

The teams had different strategies and they changed them during the game. One team distributed by the "runner" all the pages (2-3 pages per node) in the first round, while the second team only gave a single sheet to each member, and on completion of the sheet processing, the next sheet was distributed.
The "runner" walked around the rooms and try to get "progress indicator" from each node. They've noticed that each node had a different processing speed. Some of them used a calculator to calculate the difference between the high and low temperatures, and some tried to calculate with a pen and some simply in their heads.
In the debrief of the game we discussed these tactics and observations, and discussed different strategies to solve this problem. From trying to maintain a homogeneous environment of "nodes", or sending more data or tasks to stronger "nodes", or sending the same tasks to different "nodes" and using the fastest one, and canceling the rest (see "Speculative Execution").
While the "nodes" performed their task, they've noticed a pattern in the data; the main differences in the temperature were in the spring and autumn and not in the summer nor winter. To speed things up, some of the "nodes" decided to ignore July and August. The "runners" picked up on this optimization and distributed the optimization between the rest of the "nodes". They even discussed if they can ignore more months on the calculations. One of the team decided to focus on only 4 months (March, April, October and November).
When we discussed this observation in the debrief after the game, the concept of iteration required to write Map-Reduce jobs was clearer. The ability to run a job on a smaller set of the data, check the performance, check the results and optimize the job to handle the bigger scale of data, rerun it on the data, is essential to run a "good" Map-Reduce job. "Good" in terms of quality of the output and the time needed to run it. 

Quick Recap

The feedback that I got from the participants indicated that the Map-Reduce game is a simple and entertaining way to introduce the concepts of Map-Reduce to a large team of developers, before diving into the technical details. It gives a good sense of the simplicity of the individual tasks and the complexity of the infrastructure to run them. It is a good and simple first step into the new way of engineering big scale programming.