Apache Pig - PluckTuple()



After performing operations like join to differentiate the columns of the two schemas, we use the function PluckTuple(). To use this function, first of all, we have to define a string Prefix and we have to filter for the columns in a relation that begin with that prefix.

Syntax

Given below is the syntax of the PluckTuple() function.

DEFINE pluck PluckTuple(expression1) 
DEFINE pluck PluckTuple(expression1,expression3) 
pluck(expression2)

Example

Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/. The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee details who got bonus.

emp_sales.txt

1,Robin,22,25000,sales 
2,BOB,23,30000,sales 
3,Maya,23,25000,sales 
4,Sara,25,40000,sales 
5,David,23,45000,sales 
6,Maggy,22,35000,sales

emp_bonus.txt

1,Robin,22,25000,sales 
2,Jaya,23,20000,admin 
3,Maya,23,25000,sales 
4,Alia,25,50000,admin 
5,David,23,45000,sales
6,Omar,30,30000,admin

And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.

grunt> emp_sales = LOAD 'hdfs://localhost:9000/pig_data/emp_sales.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
	
grunt> emp_bonus = LOAD 'hdfs://localhost:9000/pig_data/emp_bonus.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);

Join these two relations using the join operator as shown below.

grunt> join_data = join emp_sales by sno, emp_bonus by sno;

Verify the relation join_data using the Dump operator.

grunt> Dump join_data;
 
(1,Robin,22,25000,sales,1,Robin,22,25000,sales)
(2,BOB,23,30000,sales,2,Jaya,23,20000,admin)
(3,Maya,23,25000,sales,3,Maya,23,25000,sales)
(4,Sara,25,40000,sales,4,Alia,25,50000,admin) 
(5,David,23,45000,sales,5,David,23,45000,sales) 
(6,Maggy,22,35000,sales,6,Omar,30,30000,admin)

Using PluckTuple() Function

Now, define the required expression by which you want to differentiate the columns using PluckTupe() function.

grunt> DEFINE pluck PluckTuple('a::');

Filter the columns in the join_data relation as shown below.

grunt> data = foreach join_data generate FLATTEN(pluck(*));

Describe the relation named data as shown below.

grunt> Describe data;
 
data: {emp_sales::sno: int, emp_sales::name: chararray, emp_sales::age: int,
   emp_sales::salary: int, emp_sales::dept: chararray, emp_bonus::sno: int,
   emp_bonus::name: chararray, emp_bonus::age: int, emp_bonus::salary: int,
   emp_bonus::dept: chararray}

Since we have defined the expression as “a::”, the columns of the emp_sales schema are plucked as emp_sales::column name and the columns of the emp_bonus schema are plucked as emp_bonus::column name

apache_pig_eval_functions.htm
Advertisements