Saturday, 10 November 2012

Hadoop tutorial - MapReduce by example


Hadoop User/Business case study
We know Hadoop has HDFS (file system), MapReduce framework but this Blog will try to explain how these two individual components come together and solves business problems

Common problem faced by organizations across domains is processing BIG DATA.
What is BIG DATA?
BIG DATA is a relative term i.e. for some start-up company(say CompUS) 2GB data is BIG DATA because there machine spec reads somewhat like Dual Core 2.70GHz,8M cache.
CompUS (company in reference) has to buy bigger machine to process 2GB of data every day.

Imagine CompUS grows and needs to process 4GB/day after 2 months, 10GB/day after 6 months and 1TB/day after 1year. To solve this, CompUS buys new and powerfull machine every time there data increases. That means company tried to solve problem with vertical scaling.  
Vertical scaling is not good because every time company buys new machine/hardware, its adding infrastructure/maintenance cost to company.

Hadoop solves problem by horizontal scaling, how?
Hadoop runs processing on parallel machines and every time data increases, companies can add one more machine to already existing fleet of machines. Hadoop processing doesn’t depend on spec of individual machines i.e. new machine added doesn’t have to be powerful, expensive machine but can be somewhat like commodity machine.

If CompUS uses Hadoop over vertical scaling it gains following
            1)   Low Infrastructure cost.
            2)  System that can be easily scaled in future without effecting existing infrastructure.

Let’s dig into User/Business cases with Hadoop in perspective.
Financial domain:
Financial firms dealing with Stock/equity trading generates huge data every day.

Problem set:
Data in hand contains information of various traders, brokers, orders etc. and company wants to generate metrics for each trader.

Infrastructure in hand:
11 machines running in parallel under Hadoop and are controlled by 1 NameNode (Hadoop Master Node, controlling node) and remaining 10 machine acts as Slaves, gets commands from Master Node


Solution-Hadoop way:
1)      Import data into HDFS (Hadoop file system)
a.      HDFS will split data across all machines in small-small blocks i.e. 10 GB of data will get divided among 10 machines and each machine gets 1GB of data.
b.      HDFS also replicate each block of data across multiple machines to handle machine failure at runtime.
c.       At this point, data is simply divided into smaller blocks that means information of “trader A” can be on machine 1/2/3… and same for all other traders.
2)      Write MapReduce function for analysis
3)      Map function
1)      Map function will create “Key/value” pair somewhat like
a.      If Original data:
                                                              i.      “trader A”, “buy prices”, “sell price”
                                                            ii.      “trader B”, “buy price”, “sell price”
                                                          iii.      “trader C”, “buy price”, “sell price”
b.      Key-value pair generated
                                                              i.      Key: “trader A” – Value: “buy price”, “sell price”
                                                            ii.      Key: “trader B” – Value: “buy price”, “sell price”
                                                          iii.      Key: “trader C” – Value: “buy price”, “sell price”  
c.       Map function will run in parallel on each of 10 machines containing data
d.      Each machine running Map function, will generate intermediate result of key/value pair
e.      Note, at this point, each machine has only created key/value pair from 1GB of data allocated earlier and each machine contains multiple distinct keys at this point.
4)      Hadoop magic before running Reduce function
a.      Hadoop(Master Node) will wait until all machine complete running Map function
b.      Hadoop will now sort and scuffle data (very important point)
                                                              i.      Sort data on individual machine wrt Key
                                                            ii.      Scuffle data across multiple machines wrt Key
                                                          iii.      Data for “Trader A” from machine1/2/3… will be consolidated on say machine1
                                                           iv.       Data for “Trader B” from machine 1/2/3… will be consolidated on machine 2 and so on
c.       At this point, all data required for analysis for “Trader A” is available on 1 machine and same for other Traders
5)      After doing sorting/scuffling of data across nodes, Master Node will ask all slaves machine to run Reduce function in parallel.
6)      Reduce function is main logic that will generate final metric for “Trader A”
7)      Say we want to analyze total profit made by each Trader?
8)      Time to run Reduce function
a.      Analysis will be easy and fast because each machine only gets required data for 1 trader.
b.      Reduce function get list of key/value pair
c.       key is same for individual function, each machine runs 1 function at one time, multiple functions in parallel on each slave machine
d.      Iterate over list  (Reduce function)
                                                              i.      Total profit = 0;
For each (key/value pair){
total profit = total profit + (buy price – sell price)
}
9)      Once all Reduce functions are completed on all slave machines, Master Node aggregates data and stores output in HDFS file system
10)  Consolidated after MapReduce will read somewhat like
a.      Trader              Total profit
b.      Trader A          10,000
c.       Trader B             -9,000
d.      Trader C          12,000

  

No comments:

Post a Comment