Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Tokenize text using NLTK in python
Tokenization is the process of breaking down text into individual pieces called tokens. In NLTK and Python, tokenization converts a string into a list of tokens, making it easier to process text word by word instead of character by character.
For example, given the input string ?
Hi man, how have you been?
We should get the output ?
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']
Basic Word Tokenization
NLTK provides the word_tokenize() function to split text into words and punctuation marks ?
from nltk.tokenize import word_tokenize my_sent = "Hi man, how have you been?" tokens = word_tokenize(my_sent) print(tokens)
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']
Sentence Tokenization
You can also tokenize text into sentences using sent_tokenize() ?
from nltk.tokenize import sent_tokenize
text = "Hi man, how have you been? I hope you are doing well. Let's meet soon!"
sentences = sent_tokenize(text)
for sentence in sentences:
print(sentence)
Hi man, how have you been? I hope you are doing well. Let's meet soon!
Filtering Punctuation
To get only words without punctuation, you can filter the tokens ?
from nltk.tokenize import word_tokenize import string my_sent = "Hi man, how have you been?" tokens = word_tokenize(my_sent) # Filter out punctuation words_only = [token for token in tokens if token not in string.punctuation] print(words_only)
['Hi', 'man', 'how', 'have', 'you', 'been']
Common Use Cases
| Function | Purpose | Output Type |
|---|---|---|
word_tokenize() |
Split into words and punctuation | List of tokens |
sent_tokenize() |
Split into sentences | List of sentences |
| Filter punctuation | Get only words | List of words |
Conclusion
NLTK's tokenization functions make it easy to break text into manageable pieces. Use word_tokenize() for word-level processing and sent_tokenize() for sentence-level analysis.
