- Apache Pig Tutorial
- Apache Pig - Home
- Apache Pig Introduction
- Apache Pig - Overview
- Apache Pig - Architecture
- Apache Pig Environment
- Apache Pig - Installation
- Apache Pig - Execution
- Apache Pig - Grunt Shell
- Pig Latin
- Pig Latin - Basics
- Load & Store Operators
- Apache Pig - Reading Data
- Apache Pig - Storing Data
- Diagnostic Operators
- Apache Pig - Diagnostic Operator
- Apache Pig - Describe Operator
- Apache Pig - Explain Operator
- Apache Pig - Illustrate Operator
- Grouping & Joining
- Apache Pig - Group Operator
- Apache Pig - Cogroup Operator
- Apache Pig - Join Operator
- Apache Pig - Cross Operator
- Combining & Splitting
- Apache Pig - Union Operator
- Apache Pig - Split Operator
- Pig Latin Built-In Functions
- Apache Pig - Eval Functions
- Load & Store Functions
- Apache Pig - Bag & Tuple Functions
- Apache Pig - String Functions
- Apache Pig - date-time Functions
- Apache Pig - Math Functions
- Other Modes Of Execution
- Apache Pig - User-Defined Functions
- Apache Pig - Running Scripts
- Apache Pig Useful Resources
- Apache Pig - Quick Guide
- Apache Pig - Useful Resources
- Apache Pig - Discussion
Apache Pig - PluckTuple()
After performing operations like join to differentiate the columns of the two schemas, we use the function PluckTuple(). To use this function, first of all, we have to define a string Prefix and we have to filter for the columns in a relation that begin with that prefix.
Syntax
Given below is the syntax of the PluckTuple() function.
DEFINE pluck PluckTuple(expression1) DEFINE pluck PluckTuple(expression1,expression3) pluck(expression2)
Example
Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/. The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee details who got bonus.
emp_sales.txt
1,Robin,22,25000,sales 2,BOB,23,30000,sales 3,Maya,23,25000,sales 4,Sara,25,40000,sales 5,David,23,45000,sales 6,Maggy,22,35000,sales
emp_bonus.txt
1,Robin,22,25000,sales 2,Jaya,23,20000,admin 3,Maya,23,25000,sales 4,Alia,25,50000,admin 5,David,23,45000,sales 6,Omar,30,30000,admin
And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.
grunt> emp_sales = LOAD 'hdfs://localhost:9000/pig_data/emp_sales.txt' USING PigStorage(',') as (sno:int, name:chararray, age:int, salary:int, dept:chararray); grunt> emp_bonus = LOAD 'hdfs://localhost:9000/pig_data/emp_bonus.txt' USING PigStorage(',') as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
Join these two relations using the join operator as shown below.
grunt> join_data = join emp_sales by sno, emp_bonus by sno;
Verify the relation join_data using the Dump operator.
grunt> Dump join_data; (1,Robin,22,25000,sales,1,Robin,22,25000,sales) (2,BOB,23,30000,sales,2,Jaya,23,20000,admin) (3,Maya,23,25000,sales,3,Maya,23,25000,sales) (4,Sara,25,40000,sales,4,Alia,25,50000,admin) (5,David,23,45000,sales,5,David,23,45000,sales) (6,Maggy,22,35000,sales,6,Omar,30,30000,admin)
Using PluckTuple() Function
Now, define the required expression by which you want to differentiate the columns using PluckTupe() function.
grunt> DEFINE pluck PluckTuple('a::');
Filter the columns in the join_data relation as shown below.
grunt> data = foreach join_data generate FLATTEN(pluck(*));
Describe the relation named data as shown below.
grunt> Describe data; data: {emp_sales::sno: int, emp_sales::name: chararray, emp_sales::age: int, emp_sales::salary: int, emp_sales::dept: chararray, emp_bonus::sno: int, emp_bonus::name: chararray, emp_bonus::age: int, emp_bonus::salary: int, emp_bonus::dept: chararray}
Since we have defined the expression as “a::”, the columns of the emp_sales schema are plucked as emp_sales::column name and the columns of the emp_bonus schema are plucked as emp_bonus::column name