Apache Solr - On Hadoop
Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. In this section, let us understand how you can install Hadoop on your system.
Given below are the steps to be followed to download Hadoop onto your system.
Step 1 − Go to the homepage of Hadoop. You can use the link − www.hadoop.apache.org/. Click the link Releases, as highlighted in the following screenshot.
It will redirect you to the Apache Hadoop Releases page which contains links for mirrors of source and binary files of various versions of Hadoop as follows −
Step 2 − Select the latest version of Hadoop (in our tutorial, it is 2.6.4) and click its binary link. It will take you to a page where mirrors for Hadoop binary are available. Click one of these mirrors to download Hadoop.
Download Hadoop from Command Prompt
Open Linux terminal and login as super-user.
$ su password:
Go to the directory where you need to install Hadoop, and save the file there using the link copied earlier, as shown in the following code block.
# cd /usr/local # wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop- 2.6.4/hadoop-2.6.4.tar.gz
After downloading Hadoop, extract it using the following commands.
# tar zxvf hadoop-2.6.4.tar.gz # mkdir hadoop # mv hadoop-2.6.4/* to hadoop/ # exit
Follow the steps given below to install Hadoop in pseudo-distributed mode.
Step 1: Setting Up Hadoop
You can set the Hadoop environment variables by appending the following commands to ~/.bashrc file.
export HADOOP_HOME = /usr/local/hadoop export HADOOP_MAPRED_HOME = $HADOOP_HOME export HADOOP_COMMON_HOME = $HADOOP_HOME export HADOOP_HDFS_HOME = $HADOOP_HOME export YARN_HOME = $HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR = $HADOOP_HOME/lib/native export PATH = $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL = $HADOOP_HOME
Next, apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.
export JAVA_HOME = /usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop −
The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.
Open the core-site.xml and add the following properties inside the <configuration>, </configuration> tags.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1 (In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties inside the <configuration>, </configuration> tags.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration>
Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties inside the <configuration>, </configuration> tags.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Verifying Hadoop Installation
The following steps are used to verify the Hadoop installation.
Step 1: Name Node Setup
Set up the namenode using the command "hdfs namenode –format" as follows.
$ cd ~ $ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.4 ... ... 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/
Step 2: Verifying the Hadoop dfs
The following command is used to start the Hadoop dfs. Executing this command will start your Hadoop file system.
The expected output is as follows −
10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-namenode-localhost.out localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0]
Step 3: Verifying the Yarn Script
The following command is used to start the Yarn script. Executing this command will start your Yarn demons.
The expected output as follows −
starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.6.4/logs/yarn- hadoop-resourcemanager-localhost.out localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.6.4/logs/yarn-hadoop-nodemanager-localhost.out
Step 4: Accessing Hadoop on Browser
The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on browser.
Installing Solr on Hadoop
Follow the steps given below to download and install Solr.
Open the homepage of Apache Solr by clicking the following link − https://lucene.apache.org/solr/
Click the download button (highlighted in the above screenshot). On clicking, you will be redirected to the page where you have various mirrors of Apache Solr. Select a mirror and click on it, which will redirect you to a page where you can download the source and binary files of Apache Solr, as shown in the following screenshot.
On clicking, a folder named Solr-6.2.0.tqz will be downloaded in the downloads folder of your system. Extract the contents of the downloaded folder.
Create a folder named Solr in the Hadoop home directory and move the contents of the extracted folder to it, as shown below.
$ mkdir Solr $ cd Downloads $ mv Solr-6.2.0 /home/Hadoop/
Browse through the bin folder of Solr Home directory and verify the installation using the version option, as shown in the following code block.
$ cd bin/ $ ./Solr version 6.2.0
Setting home and path
Open the .bashrc file using the following command −
[Hadoop@localhost ~]$ source ~/.bashrc
Now set the home and path directories for Apache Solr as follows −
export SOLR_HOME = /home/Hadoop/Solr export PATH = $PATH:/$SOLR_HOME/bin/
Open the terminal and execute the following command −
[Hadoop@localhost Solr]$ source ~/.bashrc
Now, you can execute the commands of Solr from any directory.