Article Categories

Selected Reading

Tokenize text using NLTK in python

Python Server Side Programming Programming

Tokenization is the process of breaking down text into individual pieces called tokens. In NLTK and Python, tokenization converts a string into a list of tokens, making it easier to process text word by word instead of character by character.

For example, given the input string ?

Hi man, how have you been?

We should get the output ?

['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

Basic Word Tokenization

NLTK provides the word_tokenize() function to split text into words and punctuation marks ?

from nltk.tokenize import word_tokenize

my_sent = "Hi man, how have you been?"
tokens = word_tokenize(my_sent)

print(tokens)

['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

Sentence Tokenization

You can also tokenize text into sentences using sent_tokenize() ?

from nltk.tokenize import sent_tokenize

text = "Hi man, how have you been? I hope you are doing well. Let's meet soon!"
sentences = sent_tokenize(text)

for sentence in sentences:
    print(sentence)

Hi man, how have you been?
I hope you are doing well.
Let's meet soon!

Filtering Punctuation

To get only words without punctuation, you can filter the tokens ?

from nltk.tokenize import word_tokenize
import string

my_sent = "Hi man, how have you been?"
tokens = word_tokenize(my_sent)

# Filter out punctuation
words_only = [token for token in tokens if token not in string.punctuation]
print(words_only)

['Hi', 'man', 'how', 'have', 'you', 'been']

Common Use Cases

Function	Purpose	Output Type
`word_tokenize()`	Split into words and punctuation	List of tokens
`sent_tokenize()`	Split into sentences	List of sentences
Filter punctuation	Get only words	List of words

Conclusion

NLTK's tokenization functions make it easy to break text into manageable pieces. Use word_tokenize() for word-level processing and sent_tokenize() for sentence-level analysis.

karthikeya Boyini

Updated on: 2026-03-24T20:49:15+05:30

1K+ Views

Previous Next