Hadoop Interview Questions


Dear readers, these Hadoop Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hadoop. As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer −

It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of namenode, datanode , secondary namenode, Jobtracker and Task tracker.

Step-1. Click on stop-all.sh and then click on start-all.sh OR

Step-2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-0.20-namenode start (press enter).

The three modes in which Hadoop can be run are −

  1. standalone (local) mode
  2. Pseudo-distributed mode
  3. Fully distributed mode

/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX specific, and nothing to do with Hadoop.

It cannot be part of the Hadoop cluster.

When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques.

the three characteristics of Big Data are −


Volume − Facebook generating 500+ terabytes of data per day.

Velocity − Analyzing 2 million records each day to identify the reason for losses.

Variety − images, audio, video, sensor data, log files, etc.  Veracity: biases, noise and abnormality in data

Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.

Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about why Hadoop is gaining so much popularity!

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform analysis on that later

Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

No, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

The mode of communication is SSH.

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

Through mapreduce program the file can be read by splitting its blocks when reading. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.

Use ‘-distcp’ command to copy,

hadoop fs -setrep -w 2 apache_hadoop/sample.txt

Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness.


hadoop dfsadmin -report

You do not need to shutdown and/or restart the entire cluster in this case.

First, add the new node's DNS name to the conf/slaves file on the master node.

Then log in to the new slave node and execute −

$ cd path/to/hadoop

$ bin/hadoop-daemon.sh start datanode

$ bin/hadoop-daemon.sh start tasktracker

then issuehadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes so that the NameNode and JobTracker know of the additional node that has been added.

Hadoop job –kill jobid

No. During safe mode replication of blocks is prohibited. The name-node awaits when all or majority of data-nodes report their blocks.

A file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command

Yes. For example, to list all the files which begin with the letter a, you could use the ls command with the * wildcard &minu;

hdfs dfs –ls a*

HDFS supports exclusive writes only.

When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client

The namenode does not have any available DataNodes.

The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers

Hadoop will make 5 splits as follows −

  • - 1 split for 64K files
  • - 2 splits for 65MB files
  • - 2 splits for 127MB files

It will restart the task again on some other TaskTracker and only if the task fails more than four ( the default setting and can be changed) times will it kill the job.

HDFS is not good at handling large number of small files. Because every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies approx 150 bytes So 10 million files, each using a block, would use about 3 gigabytes of memory. when we go for a billion files the memory requirement in namenode cannot be met.

If a node appears to be running slow, the master node can redundantly execute another instance of the same task and first output will be taken .this process is called as Speculative execution.

Yes, through Technologies like Apache Kafka, Apache Flume, and Apache Spark it is possible to do large-scale streaming.

As more and more files are added the namenode creates large edit logs. Which can substantially delay NameNode startup as the NameNode reapplies all the edits. Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time.

Bootstrap is a sleek, intuitive, and powerful mobile first front-end framework for faster and easier web development. It uses HTML, CSS and Javascript.

Bootstrap can be used as −

  • Mobile first approach − Since Bootstrap 3, the framework consists of Mobile first styles throughout the entire library instead of in separate files.

  • Browser Support − It is supported by all popular browsers.

  • Easy to get started − With just the knowledge of HTML and CSS anyone can get started with Bootstrap. Also the Bootstrap official site has a good documentation.

  • Responsive design − Bootstrap's responsive CSS adjusts to Desktops,Tablets and Mobiles.

  • Provides a clean and uniform solution for building an interface for developers.

  • It contains beautiful and functional built-in components which are easy to customize.

  • It also provides web based customization.

  • And best of all it is an open source.

Bootstrap package includes −

  • Scaffolding − Bootstrap provides a basic structure with Grid System, link styles, background. This is is covered in detail in the section Bootstrap Basic Structure

  • CSS − Bootstrap comes with feature of global CSS settings, fundamental HTML elements styled and enhanced with extensible classes, and an advanced grid system. This is covered in detail in the section Bootstrap with CSS.

  • Components − Bootstrap contains over a dozen reusable components built to provide iconography, dropdowns, navigation, alerts, popovers, and much more. This is covered in detail in the section Layout Components.

  • JavaScript Plugins − Bootstrap contains over a dozen custom jQuery plugins. You can easily include them all, or one by one. This is covered in details in the section Bootstrap Plugins.

  • Customize − You can customize Bootstrap's components, LESS variables, and jQuery plugins to get your very own version.

The Contextual classes allow you to change the background color of your table rows or individual cells.

.activeApplies the hover color to a particular row or cell
.successIndicates a successful or positive action
.warningIndicates a warning that might need attention
.dangerIndicates a dangerous or potentially negative action

Bootstrap includes a responsive, mobile first fluid grid system that appropriately scales up to 12 columns as the device or viewport size increases. It includes predefined classes for easy layout options, as well as powerful mixins for generating more semantic layouts.

Media Queries in Bootstrap allow you to move, show and hide content based on viewport size.

Following is basic structure of Bootstrap grid −

<div class="container">
   <div class="row">
      <div class="col-*-*"></div>
      <div class="col-*-*"></div>      
   <div class="row">...</div>
<div class="container">....

Offsets are a useful feature for more specialized layouts. They can be used to push columns over for more spacing, for example. The .col-xs=* classes don't support offsets, but they are easily replicated by using an empty cell.

You can easily change the order of built-in grid columns with .col-md-push-* and .col-md-pull-* modifier classes where * range from 1 to 11.

Bootstrap 3 allows to make the images responsive by adding a class .img-responsive to the <img> tag. This class applies max-width: 100%; and height: auto; to the image so that it scales nicely to the parent element.

Bootstrap sets a basic global display (background), typography, and link styles −

  • Basic Global display − Sets background-color: #fff; on the <body> element.

  • Typography − Uses the @font-family-base, @font-size-base, and @line-height-base attributes as the typographic base

  • Link styles − Sets the global link color via attribute @link-color and apply link underlines only on :hover.

Bootstrap uses Normalize to establish cross browser consistency.

Normalize.css is a modern, HTML5-ready alternative to CSS resets. It is a small CSS file that provides better cross-browser consistency in the default styling of HTML elements.

To add some emphasis to a paragraph, add class="lead". This will give you larger font size, lighter weight, and a taller line height

Bootstrap supports ordered lists, unordered lists, and definition lists.

  • Ordered lists − An ordered list is a list that falls in some sort of sequential order and is prefaced by numbers.

  • Unordered lists − An unordered list is a list that doesn't have any particular order and is traditionally styled with bullets. If you do not want the bullets to appear then you can remove the styling by using the class .list-unstyled. You can also place all list items on a single line using the class .list-inline.

  • Definition lists − In this type of list, each list item can consist of both the <dt> and the <dd> elements. <dt> stands for definition term, and like a dictionary, this is the term (or phrase) that is being defined. Subsequently, the <dd> is the definition of the <dt>.

    You can make terms and descriptions in <dl> line up side-by-side using class dl-horizontal.

Glyphicons are icon fonts which you can use in your web projects. Glyphicons Halflings are not free and require licensing, however their creator has made them available for Bootstrap projects free of cost.

To use the icons, simply use the following code just about anywhere in your code. Leave a space between the icon and text for proper padding.

<span class="glyphicon glyphicon-search"></span>

The transition plugin provides simple transition effects such as Sliding or fading in modals.

A modal is a child window that is layered over its parent window. Typically, the purpose is to display content from a separate source that can have some interaction without leaving the parent window. Child windows can provide information, interaction, or more.

You can toggle the dropdown plugin's hidden content:

  • Via data attributes: Add data-toggle="dropdown" to a link or button to toggle a dropdown as shown below −

    <div class="dropdown">
      <a data-toggle="dropdown" href="#">Dropdown trigger</a>
      <ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">

    If you need to keep links intact (which is useful if the browser is not enabling JavaScript), use the data-target attribute instead of href="#"

    <div class="dropdown">
      <a id="dLabel" role="button" data-toggle="dropdown" data-target="#" href="/page.html">
        Dropdown <span class="caret"></span>
      <ul class="dropdown-menu" role="menu" aria-labelledby="dLabel">
  • Via JavaScript − To call the dropdown toggle via JavaScript, use the following method:


The Bootstrap carousel is a flexible, responsive way to add a slider to your site. In addition to being responsive, the content is flexible enough to allow images, iframes, videos, or just about any type of content that you might want.

Button groups allow multiple buttons to be stacked together on a single line. This is useful when you want to place items like alignment buttons together.

.btn-group class is used for a basic button group. Wrap a series of buttons with class .btn in .btn-group.

.btn-toolbar helps to combine sets of <div class="btn-group"> into a <div class="btn-toolbar"> for more complex components.

.btn-group-lg, .btn-group-sm, .btn-group-xs classses can be applied to button group instead of resizing each button.

.btn-group-vertical class make a set of buttons appear vertically stacked rather than horizontally.

Input groups are extended Form Controls. Using input groups you can easily prepend and append text or buttons to the text-based inputs.

By adding prepended and appended content to an input field, you can add common elements to the user's input. For example, you can add the dollar symbol, the @ for a Twitter username, or anything else that might be common for your application interface.

To prepend or append elements to a .form-control

  • Wrap it in a <div> with class .input-group

  • As a next step, within that same <div> , place your extra content inside a <span> with class .input-group-addon.

  • Now place this <span> either before or after the <input> element.

To create a tabbed navigation menu −

  • Start with a basic unordered list with the base class of .nav.

  • Add class .nav-tabs.

To create a pills navigation menu −

  • Start with a basic unordered list with the base class of .nav.

  • Add class .nav-pills.

You can stack the pills vertically using the class .nav-stacked along with the classes: .nav, .nav-pills.

The navbar is one of the prominent features of Bootstrap sites. Navbars are responsive 'meta' components that serve as navigation headers for your application or site. Navbars collapse in mobile views and become horizontal as the available viewport width increases. At its core, the navbar includes styling for site names and basic navigation.

To create a default navbar −

  • Add the classes .navbar, .navbar-default to the <nav> tag.

  • Add role="navigation" to the above element, to help with accessibility.

  • Add a header class .navbar-header to the <div> element. Include an <a> element with class navbar-brand. This will give the text a slightly larger size.

  • To add links to the navbar, simply add an unordered list with the classes of .nav, .navbar-nav.

Breadcrumbs are a great way to show hierarchy-based information for a site. In the case of blogs, breadcrumbs can show the dates of publishing, categories, or tags. They indicate the current page's location within a navigational hierarchy.

A breadcrumb in Bootstrap is simply an unordered list with a class of .breadcrumb. The separator is automatically added by CSS (bootstrap.min.css).

.pagination class is uesed to add the pagination on a page.

You can customize links by using .disabled for unclickable links and .active to indicate the current page.

Bootstrap labels are great for offering counts, tips, or other markup for pages. Use class .label to display labels.

Badges are similar to labels; the primary difference is that the corners are more rounded. Badges are mainly used to highlight new or unread items. To use badges just add <span class="badge"> to links, Bootstrap navs, and more.

As the name suggest this component can optionally increase the size of headings and add a lot of margin for landing page content. To use the Jumbotron −

  • Create a container <div> with the class of .jumbotron.

  • In addition to a larger <h1>, the font-weight is reduced to 200px.

The page header is a nice little feature to add appropriate spacing around the headings on a page. This is particularly helpful on a web page where you may have several post titles and need a way to add distinction to each of them. To use a page header, wrap your heading in a <div> with a class of .page-header.

To create thumbnails using Bootstrap −

  • Add an <a> tag with the class of .thumbnail around an image.

  • This adds four pixels of padding and a gray border.

  • On hover, an animated glow outlines the image.

it's possible to add any kind of HTML content like headings, paragraphs, or buttons into thumbnails. Follow the steps below −

  • Change the <a> tag that has a class of .thumbnail to a <div>.

  • Inside of that <div>, you can add anything you need. As this is a <div>, we can use the default span-based naming convention for sizing.

  • If you want to group multiple images, place them in an unordered list, and each list item will be floated to the left.

Bootstrap Alerts provide a way to style messages to the user. They provide contextual feedback messages for typical user actions.

You can add an optional close icon to alert.

You can add a basic alert by creating a wrapper <div> and adding a class of .alert and one of the four contextual classes (e.g., .alert-success, .alert-info, .alert-warning, .alert-danger).

To build a dismissal alert −

  • Add a basic alert by creating a wrapper <div> and adding a class of .alert and one of the four contextual classes (e.g., .alert-success, .alert-info, .alert-warning, .alert-danger).

  • Also add optional .alert-dismissable to the above <div> class.

  • Add a close button.

  • Use the <button> element with the data-dismiss="alert" data attribute.

To create a basic progress bar −

  • Add a <div> with a class of .progress.

  • Next, inside the above <div>, add an empty <div> with a class of .progress-bar.

  • Add a style attribute with the width expressed as a percentage. Say for example, style="60%"; indicates that the progress bar was at 60%.

To create a progress bar with different styles −

  • Add a <div> with a class of .progress.

  • Next, inside the above <div>, add an empty <div> with a class of .progress-bar and class progress-bar-* where * could be success, info, warning, danger.

  • Add a style attribute with the width expressed as a percentage. Say for example, style="60%"; indicates that the progress bar was at 60%.

To create a striped progress bar −

  • Add a <div> with a class of .progress and .progress-striped.

  • Next, inside the above <div>, add an empty <div> with a class of .progress-bar and class progress-bar-* where * could be success, info, warning, danger.

  • Add a style attribute with the width expressed as a percentage. Say for example, style="60%"; indicates that the progress bar was at 60%.

To create an animated progress bar −

  • Add a <div> with a class of .progress and .progress-striped. Also add class .active to .progress-striped.

  • Next, inside the above <div>, add an empty <div> with a class of .progress-bar.

  • Add a style attribute with the width expressed as a percentage. Say for example, style="60%"; indicates that the progress bar was at 60%.

You can even stack multiple progress bars. Place the multiple progress bars into the same .progress to stack them.

These are abstract object styles for building various types of components (like blog comments, Tweets, etc.) that feature a left-aligned or right-aligned image alongside the textual content. The goal of the media object is to make the code for developing these blocks of information drastically shorter.

The goal of media objects (light markup, easy extendability) is achieved by applying classes to some of the simple markup.

This class allows to float a media object (images, video, and audio) to the left or right of a content block.

If you are preparing a list where the items will be part of an unordered list, use a class. useful for comment threads or articles lists.

Panel components are used when you want to put your DOM component in a box. To get a basic panel, just add class .panel to the <div> element. Also add class .panel-default to this element.

here are two ways to add panel heading −

  • Use .panel-heading class to easily add a heading container to your panel.

  • Use any <h1>-<h6> with a .panel-title class to add a pre-styled heading.

You can add footers to panels, by wrapping buttons or secondary text in a <div> containing class .panel-footer.

Use contextual state classes such as, panel-primary, panel-success, panel-info, panel-warning, panel-danger, to make a panel more meaningful to a particular context.

Yes! To get a non-bordered table within a panel, use the class .table within the panel. Suppose there is a <div> containing .panel-body, we add an extra border to the top of the table for separation. If there is no <div> containing .panel-body, then the component moves from panel header to table without interruption.

Yes! You can include list groups within any panel. Create a panel by adding class .panel to the <div> element. Also add class .panel-default to this element. Now within this panel include your list groups.

A well is a container in <div> that causes the content to appear sunken or an inset effect on the page. To create a well, simply wrap the content that you would like to appear in the well with a <div> containing the class of .well.

The Scrollspy (auto updating nav) plugin allows you to target sections of the page based on the scroll position. In its basic implementation, as you scroll, you can add .active classes to the navbar based on the scroll position.

The affix plugin allows a <div> to become affixed to a location on the page. You can also toggle it's pinning on and off using this plugin. A common example of this are social icons. They will start in a location, but as the page hits a certain mark, the <div> will be locked in place and will stop scrolling with the rest of the page.

What is Next ?

Further you can go through your past assignments you have done with the subject and make sure you are able to speak confidently on them. If you are fresher then interviewer does not expect you will answer very complex questions, rather you have to make your basics concepts very strong.

Second it really doesn't matter much if you could not answer few questions but it matters that whatever you answered, you must have answered with confidence. So just feel confident during your interview. We at tutorialspoint wish you best luck to have a good interviewer and all the very best for your future endeavor. Cheers :-)