- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Install and Configure Apache Hadoop on a Single Node in CentOS 8?
Apache Hadoop is an open-source framework that allows for distributed processing of large data sets. It can be installed and configured on a single node, which can be useful for development and testing purposes. In this article, we will discuss how to install and configure Apache Hadoop on a single node running CentOS 8.
Step 1: Install Java
Apache Hadoop requires Java to be installed on system. To install Java, run following command −
sudo dnf install java-11-openjdk-devel
Step 2: Install Apache Hadoop
Apache Hadoop can be downloaded from official Apache website. latest stable version at time of writing this article is version 3.3.1. You can download it using following command −
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Once download is complete, extract archive using following command −
tar -zxvf hadoop-3.3.1.tar.gz
Step 3: Configure Hadoop
After extracting archive, you need to configure Hadoop. Navigate to Hadoop installation directory using following command −
cd hadoop-3.3.1
Next, open etc/hadoop/hadoop-env.sh file using a text editor −
sudo nano etc/hadoop/hadoop-env.sh
Find following line −
# export JAVA_HOME=
Uncomment it and set path to Java installation directory −
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Save and close file.
Step 4: Configure Hadoop Core
Next, open etc/hadoop/core-site.xml file using a text editor −
sudo nano etc/hadoop/core-site.xml
Add following lines −
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Save and close file.
Step 5: Configure Hadoop HDFS
Open etc/hadoop/hdfs-site.xml file using a text editor −
sudo nano etc/hadoop/hdfs-site.xml
Add following lines −
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value> </property> </configuration>
Save and close file.
Step 6: Format HDFS
Before starting Hadoop, you need to format HDFS. Run following command −
bin/hdfs namenode -format
Step 7: Start Hadoop
To start Hadoop, run following command −
sbin/start-all.sh
This will start NameNode, DataNode, and ResourceManager daemons.
Step 8: Verify Hadoop Installation
To verify installation, you can run following command −
jps
This should display following processes −
1234 NameNode 5678 DataNode 9101 ResourceManager
Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS
Step 9: Configure Hadoop YARN
YARN (Yet Another Resource Negotiator) is resource management layer of Hadoop that enables multiple data processing engines such as Apache Spark, Apache Hive, and Apache Pig to run on same Hadoop cluster.
Open etc/hadoop/yarn-site.xml file using a text editor −
sudo nano etc/hadoop/yarn-site.xml
Add following lines −
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/home/hadoop/hadoopdata/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/home/hadoop/hadoopdata/yarn/logs</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1024</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>1</value> </property> </configuration>
Save and close file.
Step 10: Verify YARN Installation
To verify installation, you can run following command −
jps
This should display following processes in addition to previous ones −
1234 NameNode 5678 DataNode 9101 ResourceManager 3456 NodeManager
Step 11: Test Hadoop
To test Hadoop, you can use hadoop command-line interface to create a directory in HDFS and upload a file to it.
Create a directory in HDFS −
bin/hadoop fs -mkdir /input
Upload a file to directory −
bin/hadoop fs -put etc/hadoop/hadoop-env.sh /input
Verify that file has been uploaded −
bin/hadoop fs -ls /input
This should display following output −
-rw-r--r-- 1 hadoop supergroup 3519 Apr 25 10:00 /input/hadoop-env.sh
Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS and tested its functionality. You can now use Hadoop to process and analyze large data sets.
Step 12: Stop Hadoop
To stop Hadoop, run following command −
sbin/stop-all.sh
This will stop all Hadoop daemons.
Step 13: Create a User for Hadoop
It is recommended to create a separate user for Hadoop. To create a user, run following command −
sudo useradd hadoop
Set a password for user −
sudo passwd hadoop
Step 14: Grant Permissions
To grant permissions to Hadoop user, create a Hadoop directory −
sudo mkdir /home/hadoop/hadoopdata sudo chown hadoop:hadoop /home/hadoop/hadoopdata
Grant permission to Hadoop directory −
sudo chmod 755 /home/hadoop/hadoopdata
Step 15: Set Environment Variables
To set environment variables, open .bashrc file using a text editor −
nano ~/.bashrc
Add following lines −
export HADOOP_HOME=/home/hadoop/hadoop-3.3.1 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Save and close file.
Step 16: Reload Environment Variables
To reload environment variables, run following command −
source ~/.bashrc
Step 17: Verify Hadoop User
To verify that Hadoop user has been created and has correct permissions, run following command −
id hadoop
This should display following output −
uid=1001(hadoop) gid=1001(hadoop) groups=1001(hadoop)
Step 18: Restart Hadoop
To restart Hadoop with new user and permissions, run following command −
sbin/start-all.sh
Congratulations! You have successfully installed and configured Apache Hadoop on a single node running CentOS, created a user for Hadoop, and granted permissions to Hadoop directory. You can now use Hadoop with Hadoop user to process and analyze large data sets.
Step 19: Configure Firewall
To allow external access to Hadoop web interface, you need to configure firewall to allow incoming connections on port 8088 and 9870.
To configure firewall, run following commands −
sudo firewall-cmd --zone=public --add-port=8088/tcp --permanent sudo firewall-cmd --zone=public --add-port=9870/tcp --permanent sudo firewall-cmd --reload
Step 20: Access Hadoop Web Interface
To access Hadoop web interface, open a web browser and enter following URL −
http://localhost:9870
This will display Hadoop NameNode web interface. You can also access Hadoop ResourceManager web interface at −
http://localhost:8088
Congratulations! You have successfully configured firewall to allow external access to Hadoop web interface and accessed Hadoop web interfaces. You can now use web interfaces to monitor Hadoop cluster and its activities.
Step 21: Install Hadoop Ecosystem Tools
To extend functionality of Hadoop, you can install various ecosystem tools such as Apache Hive, Apache Pig, and Apache Spark. These tools allow you to perform various data processing and analysis tasks.
To install these tools, you can use package manager provided by CentOS. For example, to install Apache Hive, run following command −
sudo dnf install hive
Similarly, you can install Apache Pig and Apache Spark using following commands −
sudo dnf install pig sudo dnf install spark
Step 22: Configure Hadoop Ecosystem Tools
After installing ecosystem tools, you need to configure them to work with Hadoop.
To configure Apache Hive, open hive-site.xml file located in /etc/hive/conf directory using a text editor −
sudo nano /etc/hive/conf/hive-site.xml
Add following lines −
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> <description>password to use against metastore database</description> </property> </configuration>
Save and close file.
To configure Apache Pig, open pig.properties file located in /etc/pig directory using a text editor −
sudo nano /etc/pig/pig.properties
Add following line −
fs.default.name=hdfs://localhost:9000
Save and close file.
To configure Apache Spark, open spark-env.sh file located in conf directory of Spark installation directory using a text editor −
sudo nano /path/to/spark/conf/spark-env.sh
Add following lines −
export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.3.1/etc/hadoop export SPARK_MASTER_HOST=localhost
Save and close file.
Step 23: Verify Hadoop Ecosystem Tools
To verify installation and configuration of ecosystem tools, you can run their respective command-line interfaces. For example, to run Apache Hive command-line interface, run following command −
hive
This should display Hive shell prompt. Similarly, you can run Apache Pig and Apache Spark command-line interfaces using following commands −
pig spark-shell
Congratulations! You have successfully installed and configured various Hadoop ecosystem tools and verified their functionality. You can now use these tools to perform various data processing and analysis tasks.
Conclusion
In this tutorial, you have successfully installed and configured Apache Hadoop on a single node in CentOS 8. You can now begin experimenting with Hadoop and running MapReduce jobs on your local machine. As you become more comfortable with Hadoop, you may want to expand to a multi-node cluster for better performance and fault tolerance.