Hadoop Distributed File System

Quiz

5.1 Introduction

The Hadoop Distributed File System (HDFS) which is designed for large-scale distributed data processing under MapReduce framework. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and suitable for applications that have large data sets in term of terabytes, petabytes or even more.

HDFS - Features

Highly fault tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware

5.2 Architecture

Figure 5.1: Hadoop Distributed File System High Level View

5.2.1 HDFS Block

The minimum amount of data that HDFS can read or write is called Block. Its default block size is 64MB but it can be increased as per need to change in HDFS configuration.

5.2.2 HDFS: Name Nodes and Data Nodes

HDFS follow Master/slave architecture. HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. It exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes serves read, write requests, and performs block creation, deletion, and replication upon instruction from Namenode.

HDFS: Namenode

It keeps image of entire file system namespace and file Blockmap in memory. 4GB of local RAM is sufficient to support the reasonable data structures that represent the huge number of files and directories. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done, so that, system could recover from back to last checkpointed state in the case of a crash. Without NameNode, File System cannot be used.

HDFS - Datanode

A Datanode stores data in files in its local file system. It has no knowledge about HDFS filesystem and stores each block of HDFS data in a separate file. It does not create all files in the same directory. It uses heuristics to determine optimal number of files per directory and creates directories appropriately. When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode.

5.3 HDFS Command line Interface

Hadoop does provide a set of command line utilities that work similarly to the Linux file commands.

Hadoop file shell commands, which are your primary interface with the HDFS system.

Hadoop file commands take the form of hadoop fs -cmd <args>

Where cmd is the specific file command and <args> is a variable number of arguments. The command cmd is usually named after the corresponding UNIX equivalent. For example, the command for listing files is

./bin/hadoop fs -ls

The most common file management tasks in Hadoop are (a) Adding files/directories, (b) Retrieving files/directories and (c) Deleting files/directories

a) Adding files

1) Create a directory in HDFS

./bin/hadoop fs -mkdir hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs -mkdir /user/expert/myinput
./bin/hadoop fs -mkdir myinput

All three commands create the same directory myinput in HDFS. Use anyone of them.

2) Add/Copy a local file to HDFS

./bin/hadoop fs -copyFromLocal file1.txt hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs -copyFromLocal file1.txt /user/expert/myinput
./bin/hadoop fs -copyFromLocal file1.txt myinput

The file1.txt is present in Hadoop home directory from where we run the command.

All above commands are copied the file1.txt file to myinput directory in HDFS. Among these commands any command can be used or other equivalent options as a command "-put" can also be used in place of "-copyFromLocal".

e.g.

./bin/hadoop fs -put file1.txt hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs - put file1.txt /user/expert/myinput
./bin/hadoop fs - put file1.txt myinput

b) Retrieving files

Copy from HDFS to Local File system

./bin/hadoop fs -copyToLocal hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs -copyToLocal /user/expert/myinput
./bin/hadoop fs -copyToLocal myinput

All above commands are copied the myinput directory from HDFS to local file system. Among these commands any command can be used or other equivalent options as a command "-put" can also be used in place of "-copyFromLocal".

e.g.

./bin/hadoop fs -get hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs -get /user/expert/myinput
./bin/hadoop fs -get myinput

c) Deleting files

1) Delete directory from HDFS

./bin/hadoop fs -rmr hdfs://localhost:9000/user/expert/myinput
./bin/hadoop fs -rmr /user/expert/myinput
./bin/hadoop fs -rmr myinput

All above commands are deleted the myinput directory from HDFS. Use anyone of them.

2) Delete file from HDFS

./bin/hadoop fs -rm hdfs://localhost:9000/user/expert/myinput/file1.txt
./bin/hadoop fs -rm /user/expert/myinput/file1.txt
./bin/hadoop fs -rm myinput/file1.txt

All above commands are deleted file1.txt in myinput directory from HDFS. Use anyone of them.

For more details description try the below commands

./bin/hadoop fs -help

5.4 HDFS Command Reference

There are many more commands in "./bin/hadoop dfs" than were demonstrated here, although these basic operations will get you started. Running ./bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, ./bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all operations is given below. The following conventions are used for parameters:

    "<path>" means any file or directory name.
    "<path>..." means one or more file or directory names.
    "<file>" means any filename.
    "<src>" and "<dest>" are path names in a directed operation.
    "<localSrc>" and "<localDest>" are paths as above, but on the local file system.

All other file and path names refer to objects inside HDFS.

Command	Description
-ls <path>	Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr <path>	Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du <path>	Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus <path>	Like -du, but prints a summary of disk usage of all files/directories in the path.
-mv <src> <dest>	Moves the file or directory indicated by src to dest, within HDFS.
-cp <src> <dest>	Copies the file or directory identified by src to dest, within HDFS.
-rm <path>	Removes the file or empty directory identified by path.
-rmr <path>	Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put <localSrc> <dest>	Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal <localSrc> <dest>	Identical to -put
-moveFromLocal <localSrc> <dest>	Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success.
-get [-crc] <src> <localDest>	Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge <src> <localDest>	Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat <filen-ame>	Displays the contents of filename on stdout.
-copyToLocal <src> <localDest>	Identical to -get
-moveToLocal <src> <localDest>	Works like -get, but deletes the HDFS copy on success.
-mkdir <path>	Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like mkdir -p in Linux).
-setrep [-R] [-w] rep <path>	Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz <path>	Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] <path>	Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
-stat [format] <path>	Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] <file2name>	Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,... <path>...	Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes a if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] <path>...	Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.
-chgrp [-R] group <path>...	Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.
-help <cmd-name>	Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd

5.5 Adding a New DataNode in the Hadoop Cluster

We describe the required steps for adding new node to Hadoop cluster.

1) Networking

Add new node to existing Hadoop cluster with some appropriate network configuration. To make it simple, we assume the following network configuration

For New node Configuration

IP address	: 192.168.1.103
netmask	: 255.255.255.0
hostname	: slave3.in
gateway	: leave it blank
DNS		: leave it blank

2) Add a User and SSH Access

2.1) Add a User

On new node, add "hadoop" user and set password of hadoop user to "hadoop123" or anything you want by using below commands

adduser hadoop
passwd hadoop

2.2) Setup Password less connectivity from master to new slave

Execute the following on master


mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 
chmod 644 $HOME/.ssh/authorized_keys

Copy the public key to new slave node in hadoop user $HOME directory

scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/

Execute the following on Slaves

su  - hadoop ssh -X hadoop@192.168.1.103

Copy the content of public key into file "$HOME/.ssh/authorized_keys" and then change the permission for the same by executing the following command.

cd $HOME
mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys

2.3) Check ssh login from master machine

Now check that you can ssh to new node without a password from master

ssh hadoop@192.168.1.103 or  hadoop@slave1

3) Set Hostname of New Node

You can set hostname in file /etc/sysconfig/network

On new slave3 machine
NETWORKING=yes
HOSTNAME=slave3.in

To make changes effective either restart the machine or run hostname command to new machine with the respective hostname. (Restart is good option)

On slave3 node machine

hostname slave3.in

Update /etc/hosts on all machines of the cluster with the following lines:

192.168.1.102   slave3.in       slave3

Now try to ping machine with hostnames to check whether it resolving to IP or not.

On new node machine

ping master.in

4) JAVA Installation

Ensure that java 1.6 or higher version must be installed on target machine by running "java -version" command. Desired output will look like

$ java -version

java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Server VM (build 1.6.0_03-b05, mixed mode)

5) Copy Hadoop Directory from Master to New Node

5.1) login to hadoop user on master machine if not login in hadoop user

su  - hadoop or ssh -X hadoop@master

5.2) Add the new node in the file conf/slaves

Add new node domain name in the file "$HOME/hadoop-1.2.1/conf/slave"

slave3.in

5.3) Copy Hadoop directory to new node through rsync command

rsyn -r $HOME/hadoop-1.2.1 hadoop@192.168.1.103:

6) Start the DataNode on New Node

Start the datanode daemon manually using ./bin/hadoop-daemon.sh script. It will automatically contact the master (NameNode) and join the cluster. We should also add the new node to the conf/slaves file in the master server. The script based commands will recognize the new node.

6.1) Login to new node

su - hadoop or ssh -X hadoop@192.168.1.103

6.2) Start HDFS on newly added slave node by using below command

./bin/hadoop-daemon.sh start datanode

Check the output of jps command on new node. It looks like

$ jps

7141 DataNode

10312 Jps

6.3) Optional: if you want TaskTracker for MapReduce on newly added slave node. Run the below command

./bin/hadoop-daemon.sh start tasktracker

Check the output of jps command on new node. It looks like

$ jps
7331 TaskTracker
7141 DataNode
10312 Jps

5.6 Removing or Decommissioning a DataNode from the Hadoop Cluster

We can remove node from a cluster on the fly, while it is running, without data loss. HDFS provides a decommissioning feature which ensures that removing node is performed safely. To use it, follow the steps below:

Step 0: Login to master

su - hadoop or ssh hadoop@192.168.1.100

Step 1: Change cluster configuration

An excludes file must be configured before start the cluster. Add a key named dfs.hosts.exclude to our conf/hdfs-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

e.g. Add these lines to conf/hdfs-site.xml file

<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value>
<description>>DFS exclude</description>
</property>

Step 2: Determine hosts to decommission

Each machine to be decommissioned should be added to the file identified by hdfs_exclude.txt, one domain name per line. This will prevent them from connecting to the NameNode. Content of "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" file is if you want to remove DataNode2

slave2.in

Step 3: Force configuration reload

Run the command "./bin/hadoop dfsadmin -refreshNodes" without quotes.

./bin/hadoop dfsadmin -refreshNodes

This will force the NameNode to re-read its configuration, including the newly updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

On slave2.in, check the jps command output. After some time you will see DataNode process is shutdown automatically.

Step 4: Shutdown nodes.

After the decommission process has completed, the decommissioned hardware can be safely shut down for maintenance. Run the report command to dfsadmin to check the status of decommission.

./bin/hadoop dfsadmin -report

Above command will describe the status of decommission node and connected nodes to the cluster.

Step 5: Edit excludes file again.

Once the machines have been decommissioned, they can be removed from excludes file. Running "./bin/hadoop dfsadmin -refreshNodes" again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.

Special Note: If the above process followed the tasktracker process is still running on the node which needs to down. One way is disconnect the machine as we do above steps. Master will recognize automatically and declare as dead. There is no need to do same process for removing tasktracker because it NOT too much crucial as compare to DataNode. DataNode contain the data and you want to remove safely without loss of data.

The tasktracker can be run/shutdown on the fly by the following command at any point of time.

./bin/hadoop-daemon.sh stop tasktracker
./bin/hadoop-daemon.sh start tasktracker

Modern Baby Names

Online Photo Editing

Print Page