Tokenize text using NLTK in python

Tokenization is the process of breaking down text into individual pieces called tokens. In NLTK and Python, tokenization converts a string into a list of tokens, making it easier to process text word by word instead of character by character.

For example, given the input string ?

Hi man, how have you been?

We should get the output ?

['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

Basic Word Tokenization

NLTK provides the word_tokenize() function to split text into words and punctuation marks ?

from nltk.tokenize import word_tokenize

my_sent = "Hi man, how have you been?"
tokens = word_tokenize(my_sent)

print(tokens)
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

Sentence Tokenization

You can also tokenize text into sentences using sent_tokenize() ?

from nltk.tokenize import sent_tokenize

text = "Hi man, how have you been? I hope you are doing well. Let's meet soon!"
sentences = sent_tokenize(text)

for sentence in sentences:
    print(sentence)
Hi man, how have you been?
I hope you are doing well.
Let's meet soon!

Filtering Punctuation

To get only words without punctuation, you can filter the tokens ?

from nltk.tokenize import word_tokenize
import string

my_sent = "Hi man, how have you been?"
tokens = word_tokenize(my_sent)

# Filter out punctuation
words_only = [token for token in tokens if token not in string.punctuation]
print(words_only)
['Hi', 'man', 'how', 'have', 'you', 'been']

Common Use Cases

Function Purpose Output Type
word_tokenize() Split into words and punctuation List of tokens
sent_tokenize() Split into sentences List of sentences
Filter punctuation Get only words List of words

Conclusion

NLTK's tokenization functions make it easy to break text into manageable pieces. Use word_tokenize() for word-level processing and sent_tokenize() for sentence-level analysis.

Updated on: 2026-03-24T20:49:15+05:30

928 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements