Tuesday, July 24, 2012

Map-Reduce Gamification

Why do we need a Map-Reduce Game?

Teaching how to write Map-Reduce jobs is not an easy task. You can focus on the concepts, the framework, the benefits, the problems or any other aspect of this paradigm. But if you want to give an overview that covers most of these aspect at once, you need to think differently.
Certainly not every developer should know when and how to write Map-Reduce tasks, and certainly not every company has the need and ability to run these tasks. Not every company is Google, nor in need to process Big Data in big scale. But as many technologies, learning it opens up a new set of opportunities and leading to the creation of new set of capabilities.
The availability of FOSS as Hadoop from Apache or HortonWorks or on-demand node farm from AWS Elastic Map-Reduce, lower the bar of entry for many more developers and companies.
The main problem remains the way developers and companies are thinking about their problems. Since Map-Reduce is a different way of thinking of problems (and solving them), it requires an effort to learn it before it can be adopted.
One way to solve this problem is to set up a small dedicated team that is learning "Big Data" in places like IBM's Big Data University or Yahoo's HortonWorks Seminars. The companies can also rely on enthusiastic developers who will watch online lectures, like Google Code University. But these small non-committing steps are not going to lead the company to the scale of Map-Reduce adoption in Google. But why should the companies spend more efforts on such a technology, which looks much more complex than the already not simple technologies currently used. Classic Chicken-Egg scenario.

What is the Map-Reduce Game?

I wanted to have a game that will involve all the people in our R&D department. The game should demonstrate the main aspects of the technology, but in a light way. Most of the introduction slides on the subject are usually too technical, and alienate most of the people.
The game is designed as a competition between two (or more) groups. Each team consists of a few people, where each person is a processing node for Map-Reduce jobs. The two teams are given a task of calculating some maximum value from a long list of values to perform, in a limited period of time.
Before the start of the game, the teams were given an example of the data input, and were told that they have a few minutes to decide how they plan to perform the task. Then each member was sent to his desk in different rooms to work on the task according to the plan. They could only communicate through IM, and only a single team member was allowed to walk around the various rooms.
The idea was to allow the team to "write" the Map-Reduce algorithm, and then to execute it. During the execution, they could modify the algorithm, using the "runner" as they discover optimization opportunities. They had to resolve "Server Crash", when I told a team member from each team to stop working for a few minutes. They also needed to handle different processing speed as some of the team members were much faster in the task than others. In short, they discovered during the game many of the aspects that are part of a real life Map-Reduce process.

Detailed Game Rules

I choose to use the example of calculating maximum temperature, similar to the example task in the book "Hadoop in Action" by Chuck Lam. The idea was to encourage the graduate of the game to further reading on this topic, and to find the example familiar. I used a lighter version of the data, to allow human processing, for monthly summary of a single location. You can see the example input data in this Google Spreadsheet, which I printed out for the two teams.
The task was to find for each year from 1966 to 2010, what was the month with the wider range between lowest and highest temperature. See example below:
Example of the input data for the Map-Reduce game
Each "node" received part of the printed pages to work on. Some of the years were cut between pages and the "Reducer" should be able to integrate the result of such year from couple of "nodes".
After a short introduction about "Map-Reduce" with the classical "word count" example, a sample input page was presented to the teams, and the game rules were explained.
Then, the teams received the first input page for their 5 minutes planning discussion, and then each team member, but the "runner" went to their desks. The "runner" received the rest of the pages and started distributing the pages among the team members. Only the runner was allowed to physically move between the desks, distributing and collecting the pages.
The team had 15 minutes to execute their tasks, before they needed to report their results.
The score was the number of correct results for each of the years. No points were deducted for wrong answers to encourage approximation and optimization techniques.

Highlights from the Game

The teams had different strategies and they changed them during the game. One team distributed by the "runner" all the pages (2-3 pages per node) in the first round, while the second team only gave a single sheet to each member, and on completion of the sheet processing, the next sheet was distributed.
The "runner" walked around the rooms and try to get "progress indicator" from each node. They've noticed that each node had a different processing speed. Some of them used a calculator to calculate the difference between the high and low temperatures, and some tried to calculate with a pen and some simply in their heads.
In the debrief of the game we discussed these tactics and observations, and discussed different strategies to solve this problem. From trying to maintain a homogeneous environment of "nodes", or sending more data or tasks to stronger "nodes", or sending the same tasks to different "nodes" and using the fastest one, and canceling the rest (see "Speculative Execution").
While the "nodes" performed their task, they've noticed a pattern in the data; the main differences in the temperature were in the spring and autumn and not in the summer nor winter. To speed things up, some of the "nodes" decided to ignore July and August. The "runners" picked up on this optimization and distributed the optimization between the rest of the "nodes". They even discussed if they can ignore more months on the calculations. One of the team decided to focus on only 4 months (March, April, October and November).
When we discussed this observation in the debrief after the game, the concept of iteration required to write Map-Reduce jobs was clearer. The ability to run a job on a smaller set of the data, check the performance, check the results and optimize the job to handle the bigger scale of data, rerun it on the data, is essential to run a "good" Map-Reduce job. "Good" in terms of quality of the output and the time needed to run it. 

Quick Recap

The feedback that I got from the participants indicated that the Map-Reduce game is a simple and entertaining way to introduce the concepts of Map-Reduce to a large team of developers, before diving into the technical details. It gives a good sense of the simplicity of the individual tasks and the complexity of the infrastructure to run them. It is a good and simple first step into the new way of engineering big scale programming.







Monday, March 5, 2012

Scalability and Agile Development

How can you scale a module that you are building in an agile method? Should you turn to Agile Development to be able to scale better your services?
Agile development has many of the components that are critical to be able to scale when you need.
I added to Version One diagram about Agile Benefits my version of the Scalability chart:

Agile Development gives you the ability to develop you services in shorter cycles of development toward the right scale target.
To be able to scale in an agile way to need to practice the basics of agile development; Tests, Refactoring and Continuous Integration and Deployment. I will explain why.

Tests

The heart of every agile development is automatic testing practice. It can be called Test Driven Development, Behavior Driven Development or Design by Contract. It can have Sophisticated system tests or a simple client testing for your services using Selenium or WebDriver. It should give you the feeling that you are protected from changes that might break your services. The tests are working against your API definition. The API can be on a class level to allow you to use it from other classes, or it can be for your whole service with Authentication and REST calls. The more tests you have, the more agile you can be.
These tests are used to check that you are delivering the needed functionality, and can be used to check that you are delivering the needed non-functionality (Scalability, High-Availability...). These tests are the trigger to identify when and what to scale, and should later be used to check that the new code did the trick.
When you are protected by tests, you can change your implementation and check that you didn't kill the needed functionality, and if you did break something, you discover it very early.
The Agile way of development your tests and services is as follows:
  • Write Functional Tests
  • Write Simple Implementation to satisfy the tests
  • Add (or extend) Functional Tests
  • Refactor your implementation to satisfy the new tests as well
  • Add (or extend) Non-Functional Tests
  • Re-Architect your implementation to satisfy the non-functional tests as well

Refactoring

if you are already practicing refactoring ('a series of small behavior preserving transformations. Each transformation (called a 'refactoring') does little, but a sequence of transformations can produce a significant restructuring'), you in the right mind set for breaking out your smelly modules.
Refactoring thinking is very similar to the thinking that is needed for the right Scaling decisions. Most of the changes that you are doing are not giving any functional benefit. They are just reducing the bad smells of your code. The same reason works for Scalability smells, which are usually on a different order than code smells, as they are system smells. I discussed the Scalability smells in "When to Scale".
There are early patterns for Scalability in the refactoring patterns. You can use for example, Extract Package pattern to start your journey to "divide and conquer" you system. But there is a need to a new set of patterns to allow a more significant changes in a system, beyond simple Java only and Single System scope that are covered by the refactoring patterns.
I think that it is worth to learn from one of the recent refactoring patterns, split loop,  that was added by Martin Fowler:
"You often see loops that are doing two different things at once, because they can do that with one pass through a loop. Indeed most programmers would feel very uncomfortable with this refactoring as it forces you to execute the loop twice - which is double the work.
But like so many optimizations, doing two different things in one loop is less clear than doing them separately. It also causes problems for further refactoring as it introduces temps that get in the way of further refactorings. So while refactoring, don't be afraid to get rid of the loop. When you optimize, if the loop is slow that will show up and it would be right to slam the loops back together at that point. You may be surprised at how often the loop isn't a bottleneck, or how the later refactorings open up another, more powerful, optimization."
I believe that this is the same reasoning for systems that are doing two different things at once. Trying to optimize the services, by merging them to a single machine/DB/WAR file..., is sometimes the wrong way. Splitting in many cases can open up another, more powerful, optimization.

DevOps

The concept of DevOps or Continuous Deployment is an advanced concepts in the agile development world. After we put on the shoulder of the developers the task of writing tests ("Am I a Tester?", "I don't have time to write my code AND the tests..."), we are now adding the task of deployment and operation.
These are certainly not easy concepts to implement, especially if you are not starting with a blank page. You can read for example an interesting description of introducing Continuous Deployment in Outbrain. Even for star developers like the ones in Outbrain, it was a real challenge, how can we expect for "the rest of us" to be able to do it.
Here as well, the ability to split your system into independent services, that each one of them is able to have its own release cycle, makes it is much easier to apply DevOps techniques. It will help you to:

  • Code in small Batches - if the system is too big even refactoring causes changes in too many places in your code, forcing you to deploy (build, test, copy...) too many changes each time
  • Keep the service MVP (minimal Viable Product) properties - it is easier to play (A/B testing) with a smaller service, and keep it focused
  • Maintain Functional Test Coverage - Independent service is easier to test independently
  • Maintain Monitoring and Alerts - faster problem detection (and prevention)
  • Roll Forward Fixes - if the service is small, you can fix issues with actual enhancements instead of rolling back to the last working version, as most operation guys do

The Scalability Paradox

Instead of having a small team of expensive developers and large teams of less expensive testers and system operators, we are reducing the size of the less expensive teams and increasing (at least the load on) the more expensive developer team. This strange economic math is true in the world of scalability.

Friday, February 24, 2012

How to Scale? Distributed Architecture

Distributed *

After the success of the revolutions of Distributed Data and Distributed Computing, it is time to think about Distributed Architecture.
I believe that you can find many similarities between the way that data processing is evolving throughout the years and how system architecture is evolving and can continue to evolve.
60 years of Database Evolution
 I extended on the diagram from the evolution of database to include the addition of NoSQL paradigm, after it was almost clear the RDBMS are the answer to every Data problem around during the early 2000 years. The big RDBMS tried to scale down their DB to be embedded in mobile devices and at the same time tried to scale up their DB to handle huge amount of data. It is now clear that there is no single best data solution for every problem. The days of "buying an Oracle DB and building every solution on top of it" are gone.
I would argue that the same concept is true for the architecture of most large systems; the days of "We are a Java shop, therefore every solution is based on OSGi and J2EE" are gone as well.
You may argue that OSGi is exactly built to provide the kind of modularity that we need. But I think that it is better to look on problems (after we proved that it is time to re-architect them for scale), not as nails since we have a hammer, or OSGi bundles in this case. Each scalability issue should be solved with the best tool for the problem.
The main incentive for the evolution of Data Processing, Large Scale Computing and new Architecture designs is the changing environment, mainly the explosion of Data, Users, Services and Markets. The existing technologies were unable to scale fast enough. It is impossible to build a computer big enough to compute fast enough planet scale problems. It is impossible to build a single relational database big enough to store and retrieve fast enough Big Data. In the same line, it is impossible to architect a single technology architecture to scale flexibly enough for your evolving business.

Distributed Scale

In a similar way that NoSQL databases require different thinking from the DBA and MapReduce requires different thinking from developers, software architects should think differently on Scalability issues. 
Distributed Architecture for Scale
The goal of the distributed architecture is to allow linear scalability of the system in any of the axis of growth.

  • Users - If you have double the users, you can have simply double the boxes. 
  • Services - If you want to add additional services, you can simply add these boxes independently from the other services. 
  • Markets - When you enter a new market, geographically or other dimension, you would like to do it independently from the existing ones. 
If you have one big WAR file, deployed on your JBoss server, with a single MS-SQL DB behind it, there is little chance that you will be able to scale linearly (if at all), no matter what architectural and tuning efforts you will do. Many of the optimization efforts are trying to move less critical services (like reporting) away from the main system and database, to decrease the load. It is important to understand that I recommend to move the most critical services away from the system or database and architect them in a more focused and scaleable way. This way you will get a better ROI on your scalability efforts.
The diagram above was adapted from Christian Timmerer's post to illustrate the required modularity. 
Once your decided that it is time to scale, start breaking down your system to smaller independent services, each with its own API and its own data store ("Share Nothing" concept). The data store can be based on RDBMS, but more likely you can find more suitable data solution, now that you only need to solve the data issues of a more focused service. Maybe a simple Hash Table in memory can do everything that is needed, or any of the many ready-made data solutions that were developed for similar services.
When the service is narrower and more focused, it is easier to identify what is the right scalability pattern that is needed, and to choose the right set of tools to solve it.
This is certainly not a new idea as it is discuss greatly as SOA (Service Oriented Architecture), but this it is usually used for external services, and less for internal break down of a single system. It seems that the overhead of SOA compare to a simple function call or updating of the same RDBMS, is too high for many cases. But this is the easiest option if you wish to scale differently different aspects of your system (functions or data wise).

Wednesday, February 22, 2012

When to Scale?

You know that you need to make significant changes to be able to scale sometime in the future.
Your service is serving thousands of users and the number is growing. You also have plans for wonderful features that will only bring more and more users to your service. Your business development are also telling you of a great partner (Microsoft/Google/Twitter...), who is going to sign a partnership with you and multiply your user base 10 times at least. The future is bright, if only you could scale your service in time.
Theoretical measurement of module cost/performance

Don't Scale too early

Why couldn't you build the system to scale in the first place?
This is usually the delicate balance between "Building it right" and "Building the right 'it'" (if I may borrow the terms from Pretotyping Manifesto). You could either spend your time in laying the grounds for future possible scale and risking over architecture, or you could focus on testing your ideas in a "quick-and-dirty" way. If you are successful, I can bet that you use the latter, therefore, you need to scale now.
If you scaled too early, the risk is that you spent too much time ("Initial Cost" in the diagram above) and now you have to wait longer and be much more successful to get to the "Break Even" point. I see many start up companies who are coming to raise VC funding "to finish the development", while VC prefer to put their money "to boost the marketing and sales".
Even if you could build the system to scale from the beginning, you shouldn't!

Don't Scale too late

Why couldn't you scale the service at the right time?
If you take a look at the diagram above, you can understand that taking the right time to work on scale is not simple. At the beginning of the "Change Decision Range", your system is performing well and scale in almost linear. The more efforts you put in the system (Software Development, Performance Tuning, Hardware Addition...), the better results you receive. This is certainly not the time to spend on rewriting your system for scale, or is it?
You do start to see signs of problems; some features are taking too much time to implement, some errors are causing the system to fail, some upgrades are not going smoothly as they used to be. But everything is manageable more or less, and rather quickly these problems are resolved. These are system smells, if I can borrow the concept of code smell from Martin Fowler.

The following facts are causing you to make the scale decision too late:
  • The slow gradual degradation of your architecture is building your confidence that you are still on the right track, as you are getting better in solving these problems
  • The boost in the business is putting more pressure on "keeping the lights on"
  • The focus on "keeping the lights on" is preventing you from working on deep rewrites
  • The slow gradual trend change from linear growth to plateau and to linear decline is difficult to detect
  • The more you invest in the current technology, the harder it gets to decide to switch to a new technology. This is true for managers ("How can I come and say that I was wrong before") and for developers ("I'm a professional Java developer, why should I learn Scala?"). 
You could scale the system at the right time, but it is very difficult!

Scale Right

How can you know when it is the best time to scale?
First, measure!
Work on measuring your effort spending (development, optimization, hardware...) vs. your benefit received (request number, response time, accuracy...).
Second, visualize!
Put your measurement in front of your eyes, and the eyes of your peers. A diagram similar to the one above should be reviewed every week.
Third, realize!
You can't fight reality, although it is very easy to ignore it. Realize that the linear growing curve will turn flat and then down. Know that you are going to invest in a new technology, that will return its investment in the "Break Even" point only after considerable time period.
Fourth, decide!
When you see the signs that you hit that point in the curve, start working on making the needed technology changes and rewriting of your great performing system.

Tuesday, September 13, 2011

HP and the Data Processing Landscape


I came across an interesting article by Gartner on "Hadoop and MapReduce: Big Data Analytics". I've took the long article and summarized it with the above diagram. I've also charted on it the position of my company, HP, in regards to this new and evolving landscape, with its recent acquisitions of Vertica and Autonomy.
The main points are the extensions beyond the traditional RDBMS systems toward intelligent and personal handling of events (Complex Event Processing - CPE) and toward handling Extreme Data for strategic decision. Note that I'm using here the term "Extreme Data" and not simply "Big Data", to emphasis that this new type of data handling is not only about data volume.
The spider chart on the top left is showing the main differences between the "Old" types of OLTP (Online Transaction Processing), ODS (Operational Data Store) and EDW (Enterprise Data Warehouse) and the "New" types of CPE and Extreme Data.
The Topic of Unstructured Data is also very important and the break-down of the types and levels of unstructured data is highlighted on the table on the middle right.
Another interesting topic that I would like to point out with the diagram is that "the world is round". There are many similarities between the two extremes of the data processing continuum. Many of the analysis that are applied on the Extreme Data part should be fed into the CEP to make event processing truly intelligent. The basic example of crawling the web and creating an index on one hand is used to respond immediately to user queries on the other. The former part belongs to Extreme Data and the latter to CEP, although they are tied strongly together. The strategic analysis can be used internally by enterprise or can be used for the real time event processing part of the organization (call center, customer web site...).
The diagram also shows the place for Hadoop technologies to handle many of the Extreme Data challenges, and the main parts of this stack that are relevant to the enterprise issues.