Hadoop - Overview



What is Big Data?

You must have seen data stored in conventional database management system, for example employees data, customer detail etc. It is really easy to capture and store such data in database management system. It is also pretty easy to process such data using various applications or even using SQL itself.

Big data means really big data which includes data sets with sizes beyond the ability to process using conventional database systems. Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, velocity, and variety.

Big data challenges

Following are the major challenges with big data:

  • Capturing data

  • Curation

  • Storage

  • Searching

  • Sharing

  • Transfer

  • Analysis

  • Presentation

As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.

Big data is being generated by everything around us at all times. Every digital process and social media exchange, stock exchange etc are producing digital data. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills.

As per Gartner a survey company finds that by 2015, the demand for data and analytics resources will reach 4.4 million jobs globally, but only one-third of those jobs will be filled. The emerging role of data scientist is meant to fill that skills gap.

Big data technologies

Big data are important in providing more accurate analyses which may lead to more concrete decision making and finally a greater operational efficiencies, cost reductions and reduced risk for the business.

To harness the power of big data you require an infrastructure that can manage and process huge volumes of structured and unstructured data in real time and can protect data privacy and security.

There are various technologies around in the market from different vendors including Amazon, IBM and Microsoft etc. to handle big data. Let's list down few more generalized technologies being used to manage Big Data:

While looking into technologies to handle Big Data we examine following two classes of technology:

  • Operational Big Data: These are the systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.

    NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational Big Data workloads much easier to manage, and cheaper and faster to implement.

    Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.

  • Analytical Big Data: These are the systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data.

    MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL and a system based on MapReduce can be scaled up from single servers to thousands of high and low end machines.

These two classes of technology are complementary and frequently deployed together.

Overview of Operational vs. Analytical Systems

Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 - 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database

MongoDB with Hadoop

There are new technologies like NoSQL, MPP databases and Hadoop available to address Big Data challenges and to enable new types of products and services to be delivered by the business. One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.

NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.

Advertisements