- Apache Pig Tutorial
- Apache Pig - Home
- Apache Pig Introduction
- Apache Pig - Overview
- Apache Pig - Architecture
- Apache Pig Environment
- Apache Pig - Installation
- Apache Pig - Execution
- Apache Pig - Grunt Shell
- Pig Latin
- Pig Latin - Basics
- Load & Store Operators
- Apache Pig - Reading Data
- Apache Pig - Storing Data
- Diagnostic Operators
- Apache Pig - Diagnostic Operator
- Apache Pig - Describe Operator
- Apache Pig - Explain Operator
- Apache Pig - Illustrate Operator
- Grouping & Joining
- Apache Pig - Group Operator
- Apache Pig - Cogroup Operator
- Apache Pig - Join Operator
- Apache Pig - Cross Operator
- Combining & Splitting
- Apache Pig - Union Operator
- Apache Pig - Split Operator
- Pig Latin Built-In Functions
- Apache Pig - Eval Functions
- Load & Store Functions
- Apache Pig - Bag & Tuple Functions
- Apache Pig - String Functions
- Apache Pig - date-time Functions
- Apache Pig - Math Functions
- Other Modes Of Execution
- Apache Pig - User-Defined Functions
- Apache Pig - Running Scripts
- Apache Pig Useful Resources
- Apache Pig - Quick Guide
- Apache Pig - Useful Resources
- Apache Pig - Discussion
Apache Pig - TOKENIZE()
The TOKENIZE() function of Pig Latin is used to split a string (which contains a group of words) in a single tuple and returns a bag which contains the output of the split operation.
Syntax
Given below is the syntax of the TOKENIZE() function.
grunt> TOKENIZE(expression [, 'field_delimiter'])
As a delimeter to the TOKENIZE() function, we can pass space [ ], double quote [" "], coma [ , ], parenthesis [ () ], star [ * ].
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. This file contains the details of a student like id, name, age and city. If we closely observe, the name of the student includes first and last names separated by space [ ].
student_details.txt
001,Rajiv Reddy,21,Hyderabad 002,siddarth Battacharya,22,Kolkata 003,Rajesh Khanna,22,Delhi 004,Preethi Agarwal,21,Pune 005,Trupthi Mohanthy,23,Bhuwaneshwar 006,Archana Mishra,23 ,Chennai 007,Komal Nayak,24,trivendram 008,Bharathi Nambiayar,24,Chennai
We have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
Tokenizing a String
We can use the TOKENIZE() function to split a string. As an example let us split the name using this function as shown below.
grunt> student_name_tokenize = foreach student_details Generate TOKENIZE(name);
Verification
Verify the relation student_name_tokenize using the DUMP operator as shown below.
grunt> Dump student_name_tokenize;
Output
It will produce the following output, displaying the contents of the relation student_name_tokenize as follows.
({(Rajaiv),(Reddy)}) ({(siddarth),(Battacharya)}) ({(Rajesh),(Khanna)}) ({(Preethi),(Agarwal)}) ({(Trupthi),(Mohanthy)}) ({(Archana),(Mishra)}) ({(Komal),(Nayak)}) ({(Bharathi),(Nambiayar)})
Other Delimeters
In the same way, including space [], the TOKENIZE() function accepts double quote [" "], coma [ , ], parenthesis [ () ], star [ * ] as delimeters.
Example
Suppose there is a file named details.txt with students details like id, name, age, and city. Under the name column this file contains the first name and the last name of the students separated by various delimeters as shown below.
details.txt
001,"siddarth""Battacharya",22,Kolkata 002,Rajesh*Khanna,22,Delhi 003,(Preethi)(Agarwal),21,Pune
We have loaded this file into Pig with the relation name details as shown below.
grunt> details = LOAD 'hdfs://localhost:9000/pig_data/details.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
Now, try to separate the first name and the last name of the students using TOKENIZE() as follows.
grunt> tokenize_data = foreach details Generate TOKENIZE(name);
On verifying the tokenize_data relation using dump operator you will get the following result.
grunt> Dump tokenize_data; ({(siddarth),(Battacharya)}) ({(Rajesh),(Khanna)}) ({(Preethi),(Agarwal)})