After explaining what Hadoop is in¬†a previous post here I am going to focus on how Hadoop actually works & keeping it in plain simple terminology.
As I previously discussed Hadoop is made up of HDFS and MapReduce. HDFS distributes the data and then MapReduce does the actual “work”. Those of you that know me will know that i enjoy cooking so i’ve decided to explain how MapReduce works using cooking as an example – might be slightly amusing but i’m hoping that it will help put a relatively complex algorithm into simple terms 🙂
Ok so i’m going to bake a chocolate cake… (a cut down explanation so i don’t bore you all) will whisk some eggs, sieve some flour, weigh some butter and melt some chocolate and then mix all together and bake! Now sappose i wanted to make a fruit cake instead? this time i will chop fruit pieces up instead of melting the chocolate (this could go on with lots of different cakes!)
So what does cake baking have to do with MapReduce?
Lets apply my reciepe for baking to this. Map and Reduce are two operations
Map: Whisking eggs, sieving flour and weighing butter are operations, a Map operation can be applied to each of these individually -> so you pass some eggs to a map and it will whisk them, you pass some flour to a map and it will sieve it etc.
Reduce: This is the phase where you put the output of the map together and bake to form the cake 🙂 It means that we have reduced all of the ingredients to produce one output (the reducer is aggregating the output of the map)
Now lets suppose that I enter one of my cakes into a competition and win first prize where i get funding to start my own baking company (you never know!!) this means i now need to produce 2000 cakes a day (4 different types!!) I will certainly have to hire more people and buy more ovens! Now each person will correspond to a map, each person will process a single ingredient at one time, once all of the workers are done i will have all of my ingredients prepared.
Now i need to be able to create different types of cakes (chocolate, fruit, lemon etc) all the ingredients would have been prepared by my workers but how do i actually go about creating the different types of cakes? This is the final phase of MapReduce.
MapReduce will group all of the ouputs written by every map based on the key (assume the key is the type of cake) so my egg whisk worker may only need to whisk one egg for chocolate but two eggs for lemon cake! This therefore requires them to group their output based on the key.
Finally… across all workers the output will be grouped together by key and placed into the oven to bake the final cake!
So in a nutshell that is how MapReduce… I really hope i didnt confuse anyone and that actually helped (please leave me your feedback!)
Now if you want the slightly more technical description…..
The basic idea is that you divide the job into two parts: a Map, and a Reduce.
Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines (it is possible for machines to re-distribute the work out leading to a multi-level structure), each machine will process their piece of work and send their answer back up the structure
Reduce then collects all the answers and combines them back together in some way to get a single answer to the original problem it was trying to solve.
The key to how MapReduce does “things” is to take input (for example a list of records), the records are then split amongst the machines by the map. The map computation will provide a list of key/value pairs (basically a set of two linked data items – the key is the unique identifier and the value is either the data itself or a link to the data).
Reduce then collates all of the pairs.. it will look to see where there are duplicate keys and then merges those.
So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can‚Äôt tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map.
Thanks for reading… I hope you found this useful. In my next post I plan to write about Greenplum Hadoop and the differences it has to Opensource Hadoop