How to Install and Configure Apache Hadoop on a Single Node in CentOS 8?


Apache Hadoop is an open-source framework that allows for distributed processing of large data sets. It can be installed and configured on a single node, which can be useful for development and testing purposes. In this article, we will discuss how to install and configure Apache Hadoop on a single node running CentOS 8.

Step 1: Install Java

Apache Hadoop requires Java to be installed on system. To install Java, run following command −

sudo dnf install java-11-openjdk-devel

Step 2: Install Apache Hadoop

Apache Hadoop can be downloaded from official Apache website. latest stable version at time of writing this article is version 3.3.1. You can download it using following command −

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Once download is complete, extract archive using following command −

tar -zxvf hadoop-3.3.1.tar.gz

Step 3: Configure Hadoop

After extracting archive, you need to configure Hadoop. Navigate to Hadoop installation directory using following command −

cd hadoop-3.3.1

Next, open etc/hadoop/hadoop-env.sh file using a text editor −

sudo nano etc/hadoop/hadoop-env.sh

Find following line −

# export JAVA_HOME=

Uncomment it and set path to Java installation directory −

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Save and close file.

Step 4: Configure Hadoop Core

Next, open etc/hadoop/core-site.xml file using a text editor −

sudo nano etc/hadoop/core-site.xml

Add following lines −

<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

Save and close file.

Step 5: Configure Hadoop HDFS

Open etc/hadoop/hdfs-site.xml file using a text editor −

sudo nano etc/hadoop/hdfs-site.xml

Add following lines −

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
   </property>
</configuration>

Save and close file.

Step 6: Format HDFS

Before starting Hadoop, you need to format HDFS. Run following command −

bin/hdfs namenode -format

Step 7: Start Hadoop

To start Hadoop, run following command −

sbin/start-all.sh

This will start NameNode, DataNode, and ResourceManager daemons.

Step 8: Verify Hadoop Installation

To verify installation, you can run following command −

jps

This should display following processes −

1234 NameNode
5678 DataNode
9101 ResourceManager

Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS

Step 9: Configure Hadoop YARN

YARN (Yet Another Resource Negotiator) is resource management layer of Hadoop that enables multiple data processing engines such as Apache Spark, Apache Hive, and Apache Pig to run on same Hadoop cluster.

Open etc/hadoop/yarn-site.xml file using a text editor −

sudo nano etc/hadoop/yarn-site.xml

Add following lines −

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
   <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>localhost</value>
   </property>
   <property>
      <name>yarn.nodemanager.local-dirs</name>
      <value>/home/hadoop/hadoopdata/yarn/local</value>
   </property>
   <property>
      <name>yarn.nodemanager.log-dirs</name>
      <value>/home/hadoop/hadoopdata/yarn/logs</value>
   </property>
   <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>1024</value>
   </property>
   <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>1</value>
   </property>
</configuration>

Save and close file.

Step 10: Verify YARN Installation

To verify installation, you can run following command −

jps

This should display following processes in addition to previous ones −

1234 NameNode
5678 DataNode
9101 ResourceManager
3456 NodeManager

Step 11: Test Hadoop

To test Hadoop, you can use hadoop command-line interface to create a directory in HDFS and upload a file to it.

Create a directory in HDFS −

bin/hadoop fs -mkdir /input

Upload a file to directory −

bin/hadoop fs -put etc/hadoop/hadoop-env.sh /input

Verify that file has been uploaded −

bin/hadoop fs -ls /input

This should display following output −

-rw-r--r--   1 hadoop supergroup       3519 Apr 25 10:00 /input/hadoop-env.sh

Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS and tested its functionality. You can now use Hadoop to process and analyze large data sets.

Step 12: Stop Hadoop

To stop Hadoop, run following command −

sbin/stop-all.sh

This will stop all Hadoop daemons.

Step 13: Create a User for Hadoop

It is recommended to create a separate user for Hadoop. To create a user, run following command −

sudo useradd hadoop

Set a password for user −

sudo passwd hadoop

Step 14: Grant Permissions

To grant permissions to Hadoop user, create a Hadoop directory −

sudo mkdir /home/hadoop/hadoopdata
sudo chown hadoop:hadoop /home/hadoop/hadoopdata

Grant permission to Hadoop directory −

sudo chmod 755 /home/hadoop/hadoopdata

Step 15: Set Environment Variables

To set environment variables, open .bashrc file using a text editor −

nano ~/.bashrc

Add following lines −

export HADOOP_HOME=/home/hadoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save and close file.

Step 16: Reload Environment Variables

To reload environment variables, run following command −

source ~/.bashrc

Step 17: Verify Hadoop User

To verify that Hadoop user has been created and has correct permissions, run following command −

id hadoop

This should display following output −

uid=1001(hadoop) gid=1001(hadoop) groups=1001(hadoop)

Step 18: Restart Hadoop

To restart Hadoop with new user and permissions, run following command −

sbin/start-all.sh

Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS, created a user for Hadoop, and granted permissions to Hadoop directory. You can now use Hadoop with Hadoop user to process and analyze large data sets.

Step 19: Configure Firewall

To allow external access to Hadoop web interface, you need to configure firewall to allow incoming connections on port 8088 and 9870.

To configure firewall, run following commands −

sudo firewall-cmd --zone=public --add-port=8088/tcp --permanent
sudo firewall-cmd --zone=public --add-port=9870/tcp --permanent
sudo firewall-cmd --reload

Step 20: Access Hadoop Web Interface

To access Hadoop web interface, open a web browser and enter following URL −

http://localhost:9870

This will display Hadoop NameNode web interface. You can also access Hadoop ResourceManager web interface at −

http://localhost:8088

Congratulations! You have successfully configured firewall to allow external access to Hadoop web interface and accessed Hadoop web interfaces. You can now use web interfaces to monitor Hadoop cluster and its activities.

Step 21: Install Hadoop Ecosystem Tools

To extend functionality of Hadoop, you can install various ecosystem tools such as Apache Hive, Apache Pig, and Apache Spark. These tools allow you to perform various data processing and analysis tasks.

To install these tools, you can use package manager provided by CentOS. For example, to install Apache Hive, run following command −

sudo dnf install hive

Similarly, you can install Apache Pig and Apache Spark using following commands −

sudo dnf install pig
sudo dnf install spark

Step 22: Configure Hadoop Ecosystem Tools

After installing ecosystem tools, you need to configure them to work with Hadoop.

To configure Apache Hive, open hive-site.xml file located in /etc/hive/conf directory using a text editor −

sudo nano /etc/hive/conf/hive-site.xml

Add following lines −

<configuration>
   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
      <description>JDBC connect string for a JDBC metastore</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>Driver class name for a JDBC metastore</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
      <description>username to use against metastore database</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>password</value>
      <description>password to use against metastore database</description>
   </property>
</configuration>

Save and close file.

To configure Apache Pig, open pig.properties file located in /etc/pig directory using a text editor −

sudo nano /etc/pig/pig.properties

Add following line −

fs.default.name=hdfs://localhost:9000

Save and close file.

To configure Apache Spark, open spark-env.sh file located in conf directory of Spark installation directory using a text editor −

sudo nano /path/to/spark/conf/spark-env.sh

Add following lines −

export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.3.1/etc/hadoop
export SPARK_MASTER_HOST=localhost

Save and close file.

Step 23: Verify Hadoop Ecosystem Tools

To verify installation and configuration of ecosystem tools, you can run their respective command-line interfaces. For example, to run Apache Hive command-line interface, run following command −

hive

This should display Hive shell prompt. Similarly, you can run Apache Pig and Apache Spark command-line interfaces using following commands −

pig
spark-shell

Congratulations! You have successfully installed and configured various Hadoop ecosystem tools and verified their functionality. You can now use these tools to perform various data processing and analysis tasks.

Conclusion

In this tutorial, you have successfully installed and configured Apache Hadoop on a single node in CentOS 8. You can now begin experimenting with Hadoop and running MapReduce jobs on your local machine. As you become more comfortable with Hadoop, you may want to expand to a multi-node cluster for better performance and fault tolerance.

Updated on: 12-May-2023

882 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements