Hadoop Developer
Tutorial
Software required
1. VMWare Player 4.0.4
2. Linux version – Ubuntu 12.04
3. Putty
4. winscp
5. User name: *******
6. Password: *******
Tutorial is written with
respect to VM created that can run on any windows machine using VMWare
Player. All required software are all available on VM machine.
VM can be downloaded from:
Else tutorial can be
followed if you have access to Unix/Linux OS.
Lab 1. Preparation for lab
(Not required if not
working on VM provided)
1. Unzip mirror image at any location on windows machine
2. Open VMWare player and
file -> open virtual machine and select
-> mirror image folder
->\virtual machine file
\ubuntu-server-12.04-amd64\ubuntu-server-12.04-amd64 file
3. Cltr+G and make note of IP address
and same can be used to
login via putty and winscp
4. Open Putty -> login via IP
address ->
username: ******, password: ********
5. Now we can minimize VM machine and
Putty can be used from here
Lab 2. Setting Hadoop
1. Untar Hadoop jar file
a. Go to lab/software
b. Untar Hadoop tar file into software folder
c. tar –xvf ../../downloads/Hadoop-1.0.3.tar.gz
2. Set up env. Variable
a. Open .bash_profile i.e. vi .bash_profile
b. Enter following
1. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
2. export HADOOP_INSTALL=/home/notroot/lab/software/hadoop-1.0.3
3. export HADOOP_HOME=/home/notroot/lab/software/hadoop-1.0.3
4. export PATH=$PATH:$HADOOP_INSTALL/bin
save and exit i.e. do:wq enter
c. Check
installations
java –version
hadoop version
3. configuring Hadoop/HDFS/MAPREDUCE
cd $HADOOP_HOME/conf
reference Link:
http://hadoop.apache.org/docs/stable/cluster_setup.html
Modify core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet
type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
<final>true</final>
</property>
</configuration>
Modify hdfs-site.xml
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/notroot/lab/hdfs/namenodep,/home/notroot/lab/hdfs/namenodes</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/notroot/lab/hdfs/datan1,/home/notroot/lab/hdfs/datan2</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/home/notroot/lab/hdfs/checkp</value>
<final>true</final>
</property>
</configuration>
Modify mapred-site.xml
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/notroot/lab/mapred/local1,/home/notroot/lab/mapred/local2</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/notroot/lab/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs
can include JVM debugging options -->
</property>
</configuration>
Create directories under lab/hdfs
1. mkdir namenodep
2. mkdir namenodes
3. mkdir datan1
4. mkdir datan2
5. mkdir checkp
Change permission on folders
1. chmod 755 datan1
2. chmod 755 datan2
Create directories under lab/mapred
1. mkdir local1
2. mkdir local2
3. mkdir system
Format namenode (only once)
Cmd: Hadoop namenode –format
Starting DHFS services
1) cd $HADOOP_HOME/conf
2) edit
Hadoop-env.sh and set JAVA_HOME
a. export
JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
3) start HDFS services
a. cd $HADOOP_HOME/bin
b. exec: ./start-dfs.sh
4) start MapReduce services
a. cd $HADOOP_HOME/bin
b. exec: ./start-mapred.sh
run jps and check processes running
HDFS services: DataNode,
NameNode and SecondaryNameNode
MapReduce services: TaskTracker and JobTracker
Lab 3: HDFS Lab:
1) Create an input and output
directory under HDFS for input and output files
- Hadoop fs –mkdir input
-
Hadoop fs –mkdir output
2) Check directories
- Hadoop fs –ls
3) Copy files from local system to HDFS and check if copied
- Hadoop fs –copyFromLocal /home/notroot/lab/data/txns input/
- Checking files, Hadoop fs –ls
input/
4) Copy from HDFS to local system
- Hadoop fs –copyToLocal input/txns
/home/notroot/lab/data/txndatatemp
Goto datan1 and datan2 and check how the file is split and multiple
blocks are created
Lab 3: MapReduce – Word Count
1. first we will focus on writing
Java program using Eclipse
2. Eclipse lab (most of you know)
a. Untar Hadoop tar file under, (say: c:\softwares\)
b. Create new Java project (MRLab), package lab.samples
c. Add Hadoop jar files to project created
i. Jars under c:\softwares\hadoop-1.0.3
ii. Jars under c:\softwares\hadoop-1.0.3\lib
d. Time to write Map, Reduce functions
e. We gonna write three classes and package them together in a jar file
i. Map class
ii. Reduce class
iii. Driver class (Hadoop will call main function of this class)
link for sample code:
f. Compile code and create jar file
i. Right click on “Project folder” -> export -> jar file
g. Transfer jar file from local machine to virtual machine, use WinSCP
tool for it
h. Copy jar file to /home/notroot/lab/programs (on virtual machine)
At this point, we have MapReduce
function (jar file) on virtual machine and all processes are also running on
virtual machine (HDFS, Job tracker, task tracker …)
Run MapReduce as
Hadoop jar <jar file name>.jar DriverClass input file path output file path
Hadoop jar <jar file name>.jar lab.samples.WordCount input/words output/wcount
Output file can be check by: Hadoop
fs –cat output/wcount/part-r-00000
Lab 6: Hive Configuration
Install MySQL on virtual machine
1. Sudo apt-get install mysql-server
2. Sudo apt-get install
mysql-client-core-5.5
a.
Untar Hive jar file
Ø Go to lab/software
Ø Untar Hive files into software folder
tar
–xvf../../downloads/hive-0.9.0-bin.tar.gz
Ø Browse through the directories and
check which
subdirectory
contains what files
b.
Set up .bash_profile
Ø Open .bash_profile file under home
directory
Enter the following settings
export
HIVE_INSTALL= /home/notroot/lab/software/hive-0.9.0-bin
export
PATH=$PATH:$HIVE_INSTALL/bin
Ø Save and exit .bash_profile
Ø Run following command
. .bash_profile
Ø Verify whether variable are defined
or not by typing export
at
command prompt
c.
Check Hive Table
Ø Run hive and verify if enters hive shell
hive
Ø Check databases and tables
show
databases;
show
tables;
Lab 7 : Hive Programming
Create databases
create database retail;
Use database
use retail;
Create table for storing transactional records
Create table txnrecords(txnno INT,
txndate STRING , custno INT,
amount DOUBLE, category STRING, product
STRING, city STRING,
state STRING, spendBy STRING)
row format delimited
fields terminated by ‘,’
stored as textfile;
Load the data into the table
LOAD DATA LOCAL INPATH
‘/home/notroot/lab/data/txnns’
OVERWRITE INTO TABLE txnrecords;
Describing metadata or schema of the table
describe txnreords;
Counting no of records
Select count(*) from txnrecords;
Counting total spending by category of products
Select category , sum(amount) from
txnrecords group by category;
Top 10 customers
Select custno, sum(amount) from
txnrecords group by custno limit 10;