Apache Solr - On Hadoop

Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. In this section, let us understand how you can install Hadoop on your system.

Downloading Hadoop

Given below are the steps to be followed to download Hadoop onto your system.

Step 1 − Go to the homepage of Hadoop. You can use the link − www.hadoop.apache.org/. Click the link Releases, as highlighted in the following screenshot.

It will redirect you to the Apache Hadoop Releases page which contains links for mirrors of source and binary files of various versions of Hadoop as follows −

Step 2 − Select the latest version of Hadoop (in our tutorial, it is 2.6.4) and click its binary link. It will take you to a page where mirrors for Hadoop binary are available. Click one of these mirrors to download Hadoop.

Download Hadoop from Command Prompt

Open Linux terminal and login as super-user.

$ su 
password:

Go to the directory where you need to install Hadoop, and save the file there using the link copied earlier, as shown in the following code block.

# cd /usr/local 
# wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-
2.6.4/hadoop-2.6.4.tar.gz

After downloading Hadoop, extract it using the following commands.

# tar zxvf hadoop-2.6.4.tar.gz  
# mkdir hadoop 
# mv hadoop-2.6.4/* to hadoop/ 
# exit

Installing Hadoop

Follow the steps given below to install Hadoop in pseudo-distributed mode.

Step 1: Setting Up Hadoop

You can set the Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME = /usr/local/hadoop export
HADOOP_MAPRED_HOME = $HADOOP_HOME export
HADOOP_COMMON_HOME = $HADOOP_HOME export 
HADOOP_HDFS_HOME = $HADOOP_HOME export 
YARN_HOME = $HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR = $HADOOP_HOME/lib/native 
export PATH = $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
export HADOOP_INSTALL = $HADOOP_HOME

Next, apply all the changes into the current running system.

$ source ~/.bashrc

Step 2: Hadoop Configuration

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME = /usr/local/jdk1.7.0_71

The following are the list of files that you have to edit to configure Hadoop −

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

core-site.xml

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers.

Open the core-site.xml and add the following properties inside the <configuration>, </configuration> tags.

<configuration> 
   <property>     
      <name>fs.default.name</name>     
      <value>hdfs://localhost:9000</value>   
   </property> 
</configuration>

hdfs-site.xml

The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.

Let us assume the following data.

dfs.replication (data replication value) = 1  

(In the below given path /hadoop/ is the user name. 
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) 
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode  

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) 
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties inside the <configuration>, </configuration> tags.

<configuration> 
   <property>     
      <name>dfs.replication</name>     
      <value>1</value>   
   </property>  
   
   <property>     
      <name>dfs.name.dir</name>     
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>   
   </property>  
   
   <property>     
      <name>dfs.data.dir</name>     
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>   
   </property> 
</configuration>

Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure.

yarn-site.xml

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration> 
   <property>     
      <name>yarn.nodemanager.aux-services</name>     
      <value>mapreduce_shuffle</value>   
   </property> 
</configuration>

mapred-site.xml

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml file and add the following properties inside the <configuration>, </configuration> tags.

<configuration> 
   <property>     
      <name>mapreduce.framework.name</name>     
      <value>yarn</value>   
   </property> 
</configuration>

Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

Set up the namenode using the command "hdfs namenode –format" as follows.

$ cd ~ 
$ hdfs namenode -format

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = localhost/192.168.1.11 
STARTUP_MSG:   args = [-format] STARTUP_MSG:   version = 2.6.4 
... 
... 
10/24/14 21:30:56 INFO common.Storage: Storage directory 
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 
images with txid >= 0 
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 
************************************************************/

Step 2: Verifying the Hadoop dfs

The following command is used to start the Hadoop dfs. Executing this command will start your Hadoop file system.

$ start-dfs.sh

The expected output is as follows −

10/24/14 21:37:56 
Starting namenodes on [localhost] 
localhost: starting namenode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop-
hadoop-namenode-localhost.out 
localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop-
hadoop-datanode-localhost.out 
Starting secondary namenodes [0.0.0.0]

Step 3: Verifying the Yarn Script

The following command is used to start the Yarn script. Executing this command will start your Yarn demons.

$ start-yarn.sh

The expected output as follows −

starting yarn daemons 
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.4/logs/yarn-
hadoop-resourcemanager-localhost.out 
localhost: starting nodemanager, logging to /home/hadoop/hadoop-
2.6.4/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on browser.

http://localhost:50070/

Installing Solr on Hadoop

Follow the steps given below to download and install Solr.

Step 1

Open the homepage of Apache Solr by clicking the following link − https://lucene.apache.org/solr/

Step 2

Click the download button (highlighted in the above screenshot). On clicking, you will be redirected to the page where you have various mirrors of Apache Solr. Select a mirror and click on it, which will redirect you to a page where you can download the source and binary files of Apache Solr, as shown in the following screenshot.

Step 3

On clicking, a folder named Solr-6.2.0.tqz will be downloaded in the downloads folder of your system. Extract the contents of the downloaded folder.

Step 4

Create a folder named Solr in the Hadoop home directory and move the contents of the extracted folder to it, as shown below.

$ mkdir Solr 
$ cd Downloads 
$ mv Solr-6.2.0 /home/Hadoop/

Verification

Browse through the bin folder of Solr Home directory and verify the installation using the version option, as shown in the following code block.

$ cd bin/ 
$ ./Solr version 
6.2.0

Setting home and path

Open the .bashrc file using the following command −

[Hadoop@localhost ~]$ source ~/.bashrc

Now set the home and path directories for Apache Solr as follows −

export SOLR_HOME = /home/Hadoop/Solr  
export PATH = $PATH:/$SOLR_HOME/bin/

Open the terminal and execute the following command −

[Hadoop@localhost Solr]$ source ~/.bashrc

Now, you can execute the commands of Solr from any directory.