Hadoop
What do I try to do: Experiment using Hadoop in analyze large quantiy of data for scienitifc and text mining application. I will start with installation and configuring Single Node Hadoop, load data needed, using Pig and Hive to transform and load data onto HDFS.
Single Node
Pre-requisite software tools
- OS: Ubuntu 14.01.1 on 64-bit machine with 16GB RAM
- JVM 1.7 (bundled with Ubuntu 14.01.1)
- Hadoop 2.6.0 download
- WinSCP to copy file from Windows to Ubuntu environemnt
- putty if on Windows; or totalTerminal on Mac OS
Prep the Environment
- Assume that when Ubuntu is installed, the hostname is ubuntuboninc and a user boninc is created. From Windows, use putty to ssh into unbutuboninc; from Mac OS use total terminal applicaiton:
- Add hadoop user, named hdUser
- Add password for hdUser
- sudo passwd hdUser (enter passowrd when system asks)
- Change hdUser shell to use bash shell
- sudo chsh -s /bin/bash hdUser
- Allow hdUser to be able for sudo
Hadoop Installation (Installation directory: /usr/local/hadoop-2.6.0)
- On ubuntuboninc machine, in /home/hdUser, create intall directory
- Use WinSCP to copy hadoop 2.6.0 tar file to ubuntuboninc machine, directory /home/hdUser/install
- Uncompress hadoop tar to /user/local
- cd /usr/local
- tar -xzf /home/hdUser/install/hadoop-2.6.0.tar.gz /usr/local/hadoop-2.6.0
- Create a group name hadoop
- Change hadoop files to hdUser and group hadoop
- sudo chown -R /usr/local/hadoop-2.6.0
Prepare Hadoop Environment Variables
- Use an editor to edit the file:
- hadoop-env.sh
- in directory: /usr/local/hadoop-2.6.0/etc/hadoop
- Ensure the set the following two environment variables:
- export JAVA_HOME = ${JAVA_HOME} # this is set in /home/hdUser/.bashrc
- export HADOOP_PREFIX=/usr/local/hadoop-2.6.0
Test:
- chmod +x /usr/local/etc/hadoop/hadoop-env.sh
- /usr/local/etc/hadoop/hadoop-env.sh ## execute the environement shell
- Try the following command
If the above environment variables are set up correctly, the above command will display the usage docummentation for hadoop script.
HDFS Configuration
<TBA>
Multi Nodes
<To be Added>