Apache Pig - SUM()



You can use the SUM() function of Pig Latin to get the total of the numeric values of a column in a single-column bag. While computing the total, the SUM() function ignores the NULL values.

Note

  • To get the global sum value, we need to perform a Group All operation, and calculate the sum value using the SUM() function.

  • To get the sum value of a group, we need to group it using the Group By operator and proceed with the sum function.

Syntax

Given below is the syntax of the SUM() function.

grunt> SUM(expression)

Example

Assume that we have a file named employee.txt in the HDFS directory /pig_data/ as shown below.

employee.txt

1,John,2007-01-24,250  
2,Ram,2007-05-27,220  
3,Jack,2007-05-06,170  
3,Jack,2007-04-06,100 
4,Jill,2007-04-06,220 
5,Zara,2007-06-06,300
5,Zara,2007-02-06,350

And we have loaded this file into Pig with the relation name employee_data as shown below.

grunt> employee_data = LOAD 'hdfs://localhost:9000/pig_data/ employee.txt' USING PigStorage(',')
   as (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

Calculating the Sum of All GPA

To demonstrate the SUM() function, let’s try to calculate the total number of pages typed daily of all the employees. We can use the Apache Pig’s built-in function SUM() (case sensitive) to calculate the sum of the numerical values. Let us group the relation employee_data using the Group All operator, and store the result in the relation named employee_group as shown below.

grunt> employee_group = Group employee_data all;

It will produce a relation as shown below.

grunt> Dump employee_group;
  
(all,{(5,Zara,2007-02-06,350),
(5,Zara,2007-06-06,300),
(4,Jill,2007-0406,220),
(3,Jack,2007-04-06,100),
(3,Jack,2007-05-06,170),
(2,Ram,2007-0527,220),
(1,John,2007-01-24,250)})

Let us now calculate the global sum of the pages typed daily.

grunt> student_workpages_sum = foreach employee_group Generate 
   (employee_data.name,employee_data.daily_typing_pages),SUM(employee_data.daily_typing_pages);

Verification

Verify the relation student_workpages_sum using the DUMP operator as shown below.

grunt> Dump student_workpages_sum;

Output

It will produce the following output, displaying the contents of the relation student_workpages_sum as follows.

(({ (Zara), (Zara), (Jill) ,(Jack) , (Jack) , (Ram) , (John) }, 
{ (350) , (300) , (220) ,(100) , (170)  ,  (220)  , (250)  }),1610)
apache_pig_eval_functions.htm
Advertisements