DynamoDB - MapReduce

Amazon's Elastic MapReduce (EMR) allows you to quickly and efficiently process big data. EMR runs Apache Hadoop on EC2 instances, but simplifies the process. You utilize Apache Hive to query map reduce job flows through HiveQL, a query language resembling SQL. Apache Hive serves as a way to optimize queries and your applications.

You can use the EMR tab of the management console, the EMR CLI, an API, or an SDK to launch a job flow. You also have the option to run Hive interactively or utilize a script.

The EMR read/write operations impact throughput consumption, however, in large requests, it performs retries with the protection of a backoff algorithm. Also, running EMR concurrently with other operations and tasks may result in throttling.

The DynamoDB/EMR integration does not support binary and binary set attributes.

DynamoDB/EMR Integration Prerequisites

Review this checklist of necessary items before using EMR −

An AWS account
A populated table under the same account employed in EMR operations
A custom Hive version with DynamoDB connectivity
DynamoDB connectivity support
An S3 bucket (optional)
An SSH client (optional)
An EC2 key pair (optional)

Hive Setup

Before using EMR, create a key pair to run Hive in interactive mode. The key pair allows connection to EC2 instances and master nodes of job flows.

You can perform this by following the subsequent steps −

Log in to the management console, and open the EC2 console located at https://console.aws.amazon.com/ec2/
Select a region in the upper, right-hand portion of the console. Ensure the region matches the DynamoDB region.
In the Navigation pane, select Key Pairs.
Select Create Key Pair.
In the Key Pair Name field, enter a name and select Create.
Download the resulting private key file which uses the following format: filename.pem.

Note − You cannot connect to EC2 instances without the key pair.

Hive Cluster

Create a hive-enabled cluster to run Hive. It builds the required environment of applications and infrastructure for a Hive-to-DynamoDB connection.

You can perform this task by using the following steps −

Access the EMR console.
Select Create Cluster.
In the creation screen, set the cluster configuration with a descriptive name for the cluster, select Yes for termination protection and check on Enabled for logging, an S3 destination for log folder S3 location, and Enabled for debugging.
In the Software Configuration screen, ensure the fields hold Amazon for Hadoop distribution, the latest version for AMI version, a default Hive version for Applications to be Installed-Hive, and a default Pig version for Applications to be Installed-Pig.
In the Hardware Configuration screen, ensure the fields hold Launch into EC2-Classic for Network, No Preference for EC2 Availability Zone, the default for Master-Amazon EC2 Instance Type, no check for Request Spot Instances, the default for Core-Amazon EC2 Instance Type, 2 for Count, no check for Request Spot Instances, the default for Task-Amazon EC2 Instance Type, 0 for Count, and no check for Request Spot Instances.

Be sure to set a limit providing sufficient capacity to prevent cluster failure.

In the Security and Access screen, ensure fields hold your key pair in EC2 key pair, No other IAM users in IAM user access, and Proceed without roles in IAM role.
Review the Bootstrap Actions screen, but do not modify it.
Review settings, and select Create Cluster when finished.

A Summary pane appears on the start of the cluster.

Activate SSH Session

You need an active the SSH session to connect to the master node and execute CLI operations. Locate the master node by selecting the cluster in the EMR console. It lists the master node as Master Public DNS Name.

Install PuTTY if you do not have it. Then launch PuTTYgen and select Load. Choose your PEM file, and open it. PuTTYgen will inform you of successful import. Select Save private key to save in PuTTY private key format (PPK), and choose Yes for saving without a pass phrase. Then enter a name for the PuTTY key, hit Save, and close PuTTYgen.

Use PuTTY to make a connection with the master node by first starting PuTTY. Choose Session from the Category list. Enter hadoop@DNS within the Host Name field. Expand Connection > SSH in the Category list, and choose Auth. In the controlling options screen, select Browse for Private key file for authentication. Then select your private key file and open it. Select Yes for the security alert pop-up.

When connected to the master node, a Hadoop command prompt appears, which means you can begin an interactive Hive session.

Hive Table

Hive serves as a data warehouse tool allowing queries on EMR clusters using HiveQL. The previous setups give you a working prompt. Run Hive commands interactively by simply entering “hive,” and then any commands you wish. See our Hive tutorial for more information on Hive.