
Hadoop Distributed File System
5.1 Introduction
The Hadoop Distributed File System (HDFS) which is designed for large-scale distributed data processing under MapReduce framework. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and suitable for applications that have large data sets in term of terabytes, petabytes or even more.
HDFS - Features
Highly fault tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
5.2 Architecture

5.2.1 HDFS Block
The minimum amount of data that HDFS can read or write is called Block. Its default block size is 64MB but it can be increased as per need to change in HDFS configuration.
5.2.2 HDFS: Name Nodes and Data Nodes
HDFS follow Master/slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. It exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes serves read, write requests, and performs block creation, deletion, and replication upon instruction from Namenode.
HDFS: Namenode
It keeps image of entire file system namespace and file Blockmap in memory. 4GB of local RAM is sufficient to support the reasonable data structures that represent the huge number of files and directories. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done, so that, system could recover from back to last checkpointed state in the case of a crash. Without NameNode, File System cannot be used.
HDFS - Datanode
A Datanode stores data in files in its local file system. It has no knowledge about HDFS filesystem and stores each block of HDFS data in a separate file. It does not create all files in the same directory. It uses heuristics to determine optimal number of files per directory and creates directories appropriately. When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode.
5.3 HDFS Command line Interface
Hadoop does provide a set of command line utilities that work similarly to the Linux file commands.
Hadoop file shell commands, which are your primary interface with the HDFS system.
Hadoop file commands take the form of hadoop fs -cmd <args>
Where cmd is the specific file command and <args> is a variable number of arguments. The command cmd is usually named after the corresponding UNIX equivalent. For example, the command for listing files is
./bin/hadoop fs -ls
The most common file management tasks in Hadoop are (a) Adding files/directories, (b) Retrieving files/directories and (c) Deleting files/directories
a) Adding files
1) Create a directory in HDFS
./bin/hadoop fs -mkdir hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs -mkdir /user/expert/myinput ./bin/hadoop fs -mkdir myinput
All three commands create the same directory myinput in HDFS. Use anyone of them.
2) Add/Copy a local file to HDFS
./bin/hadoop fs -copyFromLocal file1.txt hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs -copyFromLocal file1.txt /user/expert/myinput ./bin/hadoop fs -copyFromLocal file1.txt myinput
The file1.txt is present in Hadoop home directory from where we run the command.
All above commands are copied the file1.txt file to myinput directory in HDFS. Among these commands any command can be used or other equivalent options as a command "-put" can also be used in place of "-copyFromLocal".
e.g.
./bin/hadoop fs -put file1.txt hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs - put file1.txt /user/expert/myinput ./bin/hadoop fs - put file1.txt myinput
b) Retrieving files
Copy from HDFS to Local File system
./bin/hadoop fs -copyToLocal hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs -copyToLocal /user/expert/myinput ./bin/hadoop fs -copyToLocal myinput
All above commands are copied the myinput directory from HDFS to local file system. Among these commands any command can be used or other equivalent options as a command "-put" can also be used in place of "-copyFromLocal".
e.g.
./bin/hadoop fs -get hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs -get /user/expert/myinput ./bin/hadoop fs -get myinput
c) Deleting files
1) Delete directory from HDFS
./bin/hadoop fs -rmr hdfs://localhost:9000/user/expert/myinput ./bin/hadoop fs -rmr /user/expert/myinput ./bin/hadoop fs -rmr myinput
All above commands are deleted the myinput directory from HDFS. Use anyone of them.
2) Delete file from HDFS
./bin/hadoop fs -rm hdfs://localhost:9000/user/expert/myinput/file1.txt ./bin/hadoop fs -rm /user/expert/myinput/file1.txt ./bin/hadoop fs -rm myinput/file1.txt
All above commands are deleted file1.txt in myinput directory from HDFS. Use anyone of them.
For more details description try the below commands
./bin/hadoop fs -help
5.4 HDFS Command Reference
There are many more commands in "./bin/hadoop dfs" than were demonstrated here, although these basic operations will get you started. Running ./bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, ./bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.
A table of all operations is given below. The following conventions are used for parameters:
"<path>" means any file or directory name. "<path>..." means one or more file or directory names. "<file>" means any filename. "<src>" and "<dest>" are path names in a directed operation. "<localSrc>" and "<localDest>" are paths as above, but on the local file system.
All other file and path names refer to objects inside HDFS.
Command | Description |
---|---|
-ls <path> | Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. |
-lsr <path> | Behaves like -ls, but recursively displays entries in all subdirectories of path. |
-du <path> | Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix. |
-dus <path> | Like -du, but prints a summary of disk usage of all files/directories in the path. |
-mv <src> <dest> | Moves the file or directory indicated by src to dest, within HDFS. |
-cp <src> <dest> | Copies the file or directory identified by src to dest, within HDFS. |
-rm <path> | Removes the file or empty directory identified by path. |
-rmr <path> | Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path). |
-put <localSrc> <dest> | Copies the file or directory from the local file system identified by localSrc to dest within the DFS. |
-copyFromLocal <localSrc> <dest> | Identical to -put |
-moveFromLocal <localSrc> <dest> | Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success. |
-get [-crc] <src> <localDest> | Copies the file or directory in HDFS identified by src to the local file system path identified by localDest. |
-getmerge <src> <localDest> | Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest. |
-cat <filen-ame> | Displays the contents of filename on stdout. |
-copyToLocal <src> <localDest> | Identical to -get |
-moveToLocal <src> <localDest> | Works like -get, but deletes the HDFS copy on success. |
-mkdir <path> | Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like mkdir -p in Linux). |
-setrep [-R] [-w] rep <path> | Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time) |
-touchz <path> | Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0. |
-test -[ezd] <path> | Returns 1 if path exists; has zero length; or is a directory or 0 otherwise. |
-stat [format] <path> | Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y). |
-tail [-f] <file2name> | Shows the lats 1KB of file on stdout. |
-chmod [-R] mode,mode,... <path>... | Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes a if no scope is specified and does not apply a umask. |
-chown [-R] [owner][:[group]] <path>... | Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified. |
-chgrp [-R] group <path>... | Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified. |
-help <cmd-name> | Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd |
5.5 Adding a New DataNode in the Hadoop Cluster
We describe the required steps for adding new node to Hadoop cluster.
1) Networking
Add new node to existing Hadoop cluster with some appropriate network configuration. To make it simple, we assume the following network configuration
For New node Configuration
IP address : 192.168.1.103 netmask : 255.255.255.0 hostname : slave3.in gateway : leave it blank DNS : leave it blank
2) Add a User and SSH Access
2.1) Add a User
On new node, add "hadoop" user and set password of hadoop user to "hadoop123" or anything you want by using below commands
adduser hadoop passwd hadoop
2.2) Setup Password less connectivity from master to new slave
Execute the following on master
mkdir -p $HOME/.ssh chmod 700 $HOME/.ssh ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys chmod 644 $HOME/.ssh/authorized_keys
Copy the public key to new slave node in hadoop user $HOME directory
scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/
Execute the following on Slaves
Login to hadoop if not login in hadoop user
su - hadoop ssh -X hadoop@192.168.1.103
Copy the content of public key into file "$HOME/.ssh/authorized_keys" and then change the permission for the same by executing the following command.
cd $HOME mkdir -p $HOME/.ssh chmod 700 $HOME/.ssh cat id_rsa.pub >>$HOME/.ssh/authorized_keys chmod 644 $HOME/.ssh/authorized_keys
2.3) Check ssh login from master machine
Now check that you can ssh to new node without a password from master
ssh hadoop@192.168.1.103 or hadoop@slave1
3) Set Hostname of New Node
You can set hostname in file /etc/sysconfig/network
On new slave3 machine NETWORKING=yes HOSTNAME=slave3.in
To make changes effective either restart the machine or run hostname command to new machine with the respective hostname. (Restart is good option)
On slave3 node machine
hostname slave3.in
Update /etc/hosts on all machines of the cluster with the following lines:
192.168.1.102 slave3.in slave3
Now try to ping machine with hostnames to check whether it resolving to IP or not.
On new node machine
ping master.in
4) JAVA Installation
Ensure that java 1.6 or higher version must be installed on target machine by running "java -version" command. Desired output will look like
$ java -version java version "1.6.0_03" Java(TM) SE Runtime Environment (build 1.6.0_03-b05) Java HotSpot(TM) Server VM (build 1.6.0_03-b05, mixed mode)
5) Copy Hadoop Directory from Master to New Node
5.1) login to hadoop user on master machine if not login in hadoop user
su - hadoop or ssh -X hadoop@master
5.2) Add the new node in the file conf/slaves
Add new node domain name in the file "$HOME/hadoop-1.2.1/conf/slave"
slave3.in
5.3) Copy Hadoop directory to new node through rsync command
rsyn -r $HOME/hadoop-1.2.1 hadoop@192.168.1.103:
6) Start the DataNode on New Node
Start the datanode daemon manually using ./bin/hadoop-daemon.sh script. It will automatically contact the master (NameNode) and join the cluster. We should also add the new node to the conf/slaves file in the master server. The script based commands will recognize the new node.
6.1) Login to new node
su - hadoop or ssh -X hadoop@192.168.1.103
6.2) Start HDFS on newly added slave node by using below command
./bin/hadoop-daemon.sh start datanode
Check the output of jps command on new node. It looks like
$ jps
7141 DataNode
10312 Jps
6.3) Optional: if you want TaskTracker for MapReduce on newly added slave node. Run the below command
./bin/hadoop-daemon.sh start tasktracker
Check the output of jps command on new node. It looks like
$ jps 7331 TaskTracker 7141 DataNode 10312 Jps
5.6 Removing or Decommissioning a DataNode from the Hadoop Cluster
We can remove node from a cluster on the fly, while it is running, without data loss. HDFS provides a decommissioning feature which ensures that removing node is performed safely. To use it, follow the steps below:
Step 0: Login to master
Login to master machine user where hadoop is installed
su - hadoop or ssh hadoop@192.168.1.100
Step 1: Change cluster configuration
An excludes file must be configured before start the cluster. Add a key named dfs.hosts.exclude to our conf/hdfs-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.
e.g. Add these lines to conf/hdfs-site.xml file
<property> <name>dfs.hosts.exclude</name> <value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value> <description>>DFS exclude</description> </property>
Step 2: Determine hosts to decommission
Each machine to be decommissioned should be added to the file identified by hdfs_exclude.txt, one domain name per line. This will prevent them from connecting to the NameNode. Content of "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" file is if you want to remove DataNode2
slave2.in
Step 3: Force configuration reload
Run the command "./bin/hadoop dfsadmin -refreshNodes" without quotes.
./bin/hadoop dfsadmin -refreshNodes
This will force the NameNode to re-read its configuration, including the newly updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.
On slave2.in, check the jps command output. After some time you will see DataNode process is shutdown automatically.
Step 4: Shutdown nodes.
After the decommission process has completed, the decommissioned hardware can be safely shut down for maintenance. Run the report command to dfsadmin to check the status of decommission.
./bin/hadoop dfsadmin -report
Above command will describe the status of decommission node and connected nodes to the cluster.
Step 5: Edit excludes file again.
Once the machines have been decommissioned, they can be removed from excludes file. Running "./bin/hadoop dfsadmin -refreshNodes" again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.
Special Note: If the above process followed the tasktracker process is still running on the node which needs to down. One way is disconnect the machine as we do above steps. Master will recognize automatically and declare as dead. There is no need to do same process for removing tasktracker because it NOT too much crucial as compare to DataNode. DataNode contain the data and you want to remove safely without loss of data.
The tasktracker can be run/shutdown on the fly by the following command at any point of time.
./bin/hadoop-daemon.sh stop tasktracker ./bin/hadoop-daemon.sh start tasktracker