Tuesday, 27 January 2015

ETL in Hadoop (Big data)

Creating data for end user/analysis is always been a challenge. specially if we talk about big data platform.
life cycle to get data in pure form goes through multiple stages like Extract, Transform, Load, Data cleansing, partitioning, bucketing and aggregations.

This paper explains a efficient and meaning full way to transforming data from raw to pure form.

ETL methodology is well accepted across industry for data preparation but in BigData space we tend to follow ELT i.e Extract, Load and then transform. In BigData, we get all the data on HDFS and use mutiple tools like MapReduce, Cascading, Scalding, Pig, Hive etc run transformation on data.

HDFS replicates data for reliability and performance and process do take this into account while creating mutiple data layers.

Data layers
1. Staging data layer
2. Incoming data layer
3. Play data layer
4. Pure data layer
5. Snapshot data layer

Staging data layer
This is considered a transit data and need to rotate on regular basis.
this layer holds data in its raw format just after Extract and is loaded into this space. Data here is used a backup data for any reload/re-run or validating audit trails

This is usually not on HDFS.
Data is recommended to rotate after every 15 days or as per use-case.
Since not on HDFS, this has limited space allocated to it.

Incoming data layer
This is exact replica for data what we had in staging and is persisted for longer duration in compressed form.
Data is compressed along with rotation in Staging data i.e. if 2015-01-01 data is moved out of Staging on 2015-01-15 than on same day we do compression of data.

Data here is saved on HDFS with replication of 3.
This layer can hold data upto 2-3 years or as per use-case

Play data layer
This space is used as temporary location for saving any intermediate data used while transformation of data from Raw to Pure state. This space is regularly cleaned after job run.

Pure data layer
This space hold data in its Pure form i.e. Data here is final output of all the transformations that were planned and this data is available to all the down stream processes.
This resits on HDFS with replication of 3 or more and data here may or may not be compressed.

Snapshot data layer
This space is on top of Pure dataset. Data here can have multiple presentation of data in Pure layer. Each presentation is with respect to a particular use-case or down-stream application.

for e.g:
Data set in Pure layer is aggregated based on Date and Customer Id.  

Down-stream application (A) works faster when data is aggregated by Date and Down-steam application (B) works faster when data is aggregated by Customer Id. We save same Pure data in two different partition for two different down-stream applications. 

In BigData space, its very important to have huge data set separated out logical in different partition for efficient space management and reliable functionality of processes.

Different layer of data represents data in each state of transformation and helps segregate Pure from work Data. Different layers of data also helps in efficient Audit trail mechanism and reconcile of data in case of outage.


  1. $20 can get you:

    a) Movie Tickets & popcorn,
    b) A cuppo for your car keys,
    c) A clothespin holder,
    d) A Hadoop 5-in-1 package as investment in your future.

    What are you buying today?
    Visit Now: http://bit.ly/1SqESgK

  2. nice article.
    very useful information.
    keep sharing.
    best bigdata hadoop training in Bengaluru