Hadoop User/Business
case study
We know Hadoop has HDFS (file system),
MapReduce framework but this Blog will try to explain how these two individual components
come together and solves business problems
Common problem faced by organizations
across domains is processing BIG DATA.
What is BIG DATA?
BIG DATA is a relative term i.e. for
some start-up company(say CompUS) 2GB data is BIG DATA because there machine
spec reads somewhat like Dual Core
2.70GHz,8M cache.
CompUS (company in reference) has to
buy bigger machine to process 2GB of data every day.
Imagine CompUS grows and needs to
process 4GB/day after 2 months, 10GB/day after 6 months and 1TB/day after
1year. To solve this, CompUS buys new and powerfull machine every time there
data increases. That means company tried to solve problem with vertical scaling.
Vertical scaling is not good because
every time company buys new machine/hardware, its adding infrastructure/maintenance
cost to company.
Hadoop solves problem by horizontal scaling, how?
Hadoop runs processing on parallel
machines and every time data increases, companies can add one more machine to
already existing fleet of machines. Hadoop processing doesn’t depend on spec of
individual machines i.e. new machine added doesn’t have to be powerful, expensive
machine but can be somewhat like commodity machine.
If CompUS uses Hadoop over vertical
scaling it gains following
1) Low Infrastructure cost.
2) System that can be easily scaled in
future without effecting existing infrastructure.
Let’s dig into User/Business cases
with Hadoop in perspective.
Financial domain:
Financial firms dealing with Stock/equity
trading generates huge data every day.
Problem set:
Data in hand contains information of
various traders, brokers, orders etc. and company wants to generate metrics for
each trader.
Infrastructure in hand:
11 machines running in parallel under
Hadoop and are controlled by 1 NameNode (Hadoop Master Node, controlling node)
and remaining 10 machine acts as Slaves, gets commands from Master Node
Solution-Hadoop way:
1) Import data into HDFS (Hadoop file
system)
a. HDFS will split data across all
machines in small-small blocks i.e. 10 GB of data will get divided among 10
machines and each machine gets 1GB of data.
b. HDFS also replicate each block of
data across multiple machines to handle machine failure at runtime.
c. At this point, data is simply divided
into smaller blocks that means information of “trader A” can be on machine 1/2/3…
and same for all other traders.
2) Write MapReduce function
for analysis
3) Map
function
1) Map function will create “Key/value”
pair somewhat like
a. If Original data:
i.
“trader
A”, “buy prices”, “sell price”
ii.
“trader
B”, “buy price”, “sell price”
iii.
“trader
C”, “buy price”, “sell price”
b. Key-value pair generated
i.
Key: “trader A” – Value:
“buy price”, “sell price”
ii.
Key: “trader B” – Value:
“buy price”, “sell price”
iii.
Key: “trader C” – Value:
“buy price”, “sell price”
c. Map function will run in parallel on each of 10 machines containing
data
d. Each machine running Map function,
will generate intermediate result of key/value pair
e. Note, at this point, each machine has
only created key/value pair from 1GB of data allocated earlier and each machine
contains multiple distinct keys at this point.
4)
Hadoop magic before
running Reduce function
a. Hadoop(Master Node) will wait until
all machine complete running Map function
b. Hadoop will now sort and scuffle data
(very important point)
i.
Sort
data on individual machine wrt Key
ii.
Scuffle
data across multiple machines wrt
Key
iii.
Data
for “Trader A” from machine1/2/3… will be consolidated on say machine1
iv.
Data for “Trader B” from machine 1/2/3… will
be consolidated on machine 2 and so on
c. At this point, all data required for
analysis for “Trader A” is available on 1 machine and same for other Traders
5) After doing sorting/scuffling of data
across nodes, Master Node will ask all slaves machine to run Reduce function in
parallel.
6) Reduce function is main logic that
will generate final metric for “Trader A”
7) Say we want to analyze total profit
made by each Trader?
8) Time to run Reduce function
a. Analysis will be easy and fast because
each machine only gets required data for 1 trader.
b. Reduce function get list of key/value
pair
c. key is same for individual function,
each machine runs 1 function at one time, multiple functions in parallel on
each slave machine
d. Iterate over list (Reduce function)
i.
Total
profit = 0;
For each (key/value pair){
total profit = total profit + (buy
price – sell price)
}
9) Once all Reduce functions are
completed on all slave machines, Master Node aggregates data and stores output
in HDFS file system
10) Consolidated after MapReduce will
read somewhat like
a. Trader Total profit
b. Trader A 10,000
c. Trader B -9,000
d. Trader C 12,000