Apache Pig - DIFF()

The DIFF() function of Pig Latin is used to compare two bags (fields) in a tuple. It takes two fields of a tuple as input and matches them. If they match, it returns an empty bag. If they do not match, it finds the elements that exist in one field (bag) and not found in the other, and returns these elements by wrapping them within a bag.

Syntax

Given below is the syntax of the DIFF() function.

grunt> DIFF (expression, expression)

Example

Generally the DIFF() function compares two bags in a tuple. Given below is its example, here we create two relations, cogroup them, and calculate the difference between them.

Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/ as shown below. The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee details who got bonus.

emp_sales.txt

1,Robin,22,25000,sales 
2,BOB,23,30000,sales 
3,Maya,23,25000,sales 
4,Sara,25,40000,sales 
5,David,23,45000,sales 
6,Maggy,22,35000,sales

emp_bonus.txt

1,Robin,22,25000,sales 
2,Jaya,23,20000,admin 
3,Maya,23,25000,sales 
4,Alia,25,50000,admin 
5,David,23,45000,sales 
6,Omar,30,30000,admin

And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.

grunt> emp_sales = LOAD 'hdfs://localhost:9000/pig_data/emp_sales.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
	
grunt> emp_bonus = LOAD 'hdfs://localhost:9000/pig_data/emp_bonus.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);

Group the records/tuples of the relations emp_sales and emp_bonus with the key sno, using the COGROUP operator as shown below.

grunt> cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;

Verify the relation cogroup_data using the DUMP operator as shown below.

grunt> Dump cogroup_data;
  
(1,{(1,Robin,22,25000,sales)},{(1,Robin,22,25000,sales)}) 
(2,{(2,BOB,23,30000,sales)},{(2,Jaya,23,20000,admin)}) 
(3,{(3,Maya,23,25000,sales)},{(3,Maya,23,25000,sales)}) 
(4,{(4,Sara,25,40000,sales)},{(4,Alia,25,50000,admin)}) 
(5,{(5,David,23,45000,sales)},{(5,David,23,45000,sales)}) 
(6,{(6,Maggy,22,35000,sales)},{(6,Omar,30,30000,admin)})

Calculating the Difference between Two Relations

Let us now calculate the difference between the two relations using DIFF() function and store it in the relation diff_data as shown below.

grunt> diff_data = FOREACH cogroup_data GENERATE DIFF(emp_sales,emp_bonus);

Verification

Verify the relation diff_data using the DUMP operator as shown below.

grunt> Dump diff_data;
   
({}) 
({(2,BOB,23,30000,sales),(2,Jaya,23,20000,admin)}) 
({}) 
({(4,Sara,25,40000,sales),(4,Alia,25,50000,admin)}) 
({}) 
({(6,Maggy,22,35000,sales),(6,Omar,30,30000,admin)})

The diff_data relation will have an empty tuple if the records in emp_bonus and emp_sales match. In other cases, it will hold tuples from both the relations (tuples that differ).

For example, if you consider the records having sno as 1, then you will find them same in both the relations ((1,Robin,22,25000,sales), (1,Robin,22,25000,sales)). Therefore, in the diff_data relation, which is the result of DIFF() function, you will get an empty tuple for sno 1.

apache_pig_eval_functions.htm