HCatalog - Loader & Storer

The HCatLoader and HCatStorer APIs are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.

It is better to have some knowledge on Apache Pig scripts to understand this chapter better. For further reference, please go through our Apache Pig tutorial.

HCatloader

HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. Use the following syntax to load data into HDFS using HCatloader.

A = LOAD 'tablename' USING org.apache.HCatalog.pig.HCatLoader();

You must specify the table name in single quotes: LOAD 'tablename'. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.

The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and is not required when specifying the table for HCatLoader.

The following table contains the important methods and description of the HCatloader class.

Sr.No.	Method Name & Description
1	public InputFormat<?,?> getInputFormat()throws IOException Read the input format of the loading data using the HCatloader class.
2	public String relativeToAbsolutePath(String location, Path curDir) throws IOException It returns the String format of the Absolute path.
3	public void setLocation(String location, Job job) throws IOException It sets the location where the job can be executed.
4	public Tuple getNext() throws IOException Returns the current tuple (key and value) from the loop.

HCatStorer

HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. Use the following syntax for Storing operation.

A = LOAD ...
B = FOREACH A ...
...
...
my_processed_data = ...

STORE my_processed_data INTO 'tablename' USING org.apache.HCatalog.pig.HCatStorer();

You must specify the table name in single quotes: LOAD 'tablename'. Both the database and the table must be created prior to running your Pig script. If you are using a non-default database, then you must specify your input as 'dbname.tablename'.

The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is 'default' and you do not need to specify the database name in the store statement.

For the USING clause, you can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

The following table contains the important methods and description of the HCatStorer class.

Sr.No.	Method Name & Description
1	public OutputFormat getOutputFormat() throws IOException Read the output format of the stored data using the HCatStorer class.
2	public void setStoreLocation (String location, Job job) throws IOException Sets the location where to execute this store application.
3	public void storeSchema (ResourceSchema schema, String arg1, Job job) throws IOException Store the schema.
4	public void prepareToWrite (RecordWriter writer) throws IOException It helps to write data into a particular file using RecordWriter.
5	public void putNext (Tuple tuple) throws IOException Writes the tuple data into the file.

Running Pig with HCatalog

Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the Pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.

To bring in the appropriate jars for working with HCatalog, simply include the following flag −

pig –useHCatalog <Sample pig scripts file>

Setting the CLASSPATH for Execution

Use the following CLASSPATH setting for synchronizing the HCatalog with Apache Pig.

export HADOOP_HOME = <path_to_hadoop_install>
export HIVE_HOME = <path_to_hive_install>
export HCAT_HOME = <path_to_hcat_install>

export PIG_CLASSPATH = $HCAT_HOME/share/HCatalog/HCatalog-core*.jar:\
$HCAT_HOME/share/HCatalog/HCatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\
$HIVE_HOME/lib/slf4j-api-*.jar

Example

Assume we have a file student_details.txt in HDFS with the following content.

student_details.txt

001, Rajiv,    Reddy,       21, 9848022337, Hyderabad
002, siddarth, Battacharya, 22, 9848022338, Kolkata
003, Rajesh,   Khanna,      22, 9848022339, Delhi
004, Preethi,  Agarwal,     21, 9848022330, Pune
005, Trupthi,  Mohanthy,    23, 9848022336, Bhuwaneshwar
006, Archana,  Mishra,      23, 9848022335, Chennai
007, Komal,    Nayak,       24, 9848022334, trivendram
008, Bharathi, Nambiayar,   24, 9848022333, Chennai

We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.

student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING 
   PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
   phone:chararray, city:chararray);
	
student_order = ORDER student BY age DESC;
STORE student_order INTO 'student_order_table' USING org.apache.HCatalog.pig.HCatStorer();
student_limit = LIMIT student_order 4;
Dump student_limit;

The first statement of the script will load the data in the file named student_details.txt as a relation named student.
The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.
The third statement stores the processed data student_order results in a separate table named student_order_table.
The fourth statement of the script will store the first four tuples of student_order as student_limit.
Finally the fifth statement will dump the content of the relation student_limit.

Let us now execute the sample_script.pig as shown below.

$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample_script.pig

Now, check your output directory (hdfs: user/tmp/hive) for the output (part_0000, part_0001).