Apache Pig - TOKENIZE()


Advertisements

The TOKENIZE() function of Pig Latin is used to split a string (which contains a group of words) in a single tuple and returns a bag which contains the output of the split operation.

Syntax

Given below is the syntax of the TOKENIZE() function.

grunt> TOKENIZE(expression [, 'field_delimiter']) 

As a delimeter to the TOKENIZE() function, we can pass space [ ], double quote [" "], coma [ , ], parenthesis [ () ], star [ * ].

Example

Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. This file contains the details of a student like id, name, age and city. If we closely observe, the name of the student includes first and last names separated by space [ ].

student_details.txt

001,Rajiv Reddy,21,Hyderabad
002,siddarth Battacharya,22,Kolkata 
003,Rajesh Khanna,22,Delhi 
004,Preethi Agarwal,21,Pune 
005,Trupthi Mohanthy,23,Bhuwaneshwar 
006,Archana Mishra,23 ,Chennai 
007,Komal Nayak,24,trivendram 
008,Bharathi Nambiayar,24,Chennai 

We have loaded this file into Pig with the relation name student_details as shown below.

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int,  city:chararray);

Tokenizing a String

We can use the TOKENIZE() function to split a string. As an example let us split the name using this function as shown below.

grunt> student_name_tokenize = foreach student_details  Generate TOKENIZE(name);

Verification

Verify the relation student_name_tokenize using the DUMP operator as shown below.

grunt> Dump student_name_tokenize;

Output

It will produce the following output, displaying the contents of the relation student_name_tokenize as follows.

({(Rajaiv),(Reddy)})
({(siddarth),(Battacharya)})
({(Rajesh),(Khanna)})
({(Preethi),(Agarwal)})
({(Trupthi),(Mohanthy)})
({(Archana),(Mishra)})
({(Komal),(Nayak)})
({(Bharathi),(Nambiayar)})

Other Delimeters

In the same way, including space [], the TOKENIZE() function accepts double quote [" "], coma [ , ], parenthesis [ () ], star [ * ] as delimeters.

Example

Suppose there is a file named details.txt with students details like id, name, age, and city. Under the name column this file contains the first name and the last name of the students separated by various delimeters as shown below.

details.txt

001,"siddarth""Battacharya",22,Kolkata 
002,Rajesh*Khanna,22,Delhi 
003,(Preethi)(Agarwal),21,Pune 

We have loaded this file into Pig with the relation name details as shown below.

grunt> details = LOAD 'hdfs://localhost:9000/pig_data/details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int,  city:chararray);

Now, try to separate the first name and the last name of the students using TOKENIZE() as follows.

grunt> tokenize_data = foreach details Generate TOKENIZE(name);

On verifying the tokenize_data relation using dump operator you will get the following result.

grunt> Dump tokenize_data;

({(siddarth),(Battacharya)})
({(Rajesh),(Khanna)})
({(Preethi),(Agarwal)})
apache_pig_eval_functions.htm
Advertisements