Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Pos tagging and lammetization using spacy in python
Python acts as an integral tool for understanding the concepts and application of machine learning and deep learning. It offers numerous libraries and modules that provide a magnificent platform for building useful Natural Language Processing (NLP) techniques. In this article, we will discuss one such powerful library known as spaCy.
spaCy is an open-source library used to analyze and process textual data efficiently. We will explore two key NLP concepts: Part-of-Speech (PoS) tagging and lemmatization using spaCy.
What is spaCy?
spaCy is an industrial-strength NLP library designed for production use. It provides fast and accurate text processing capabilities including tokenization, PoS tagging, lemmatization, and named entity recognition. spaCy is written in Cython, making it highly efficient for large-scale text processing tasks.
Installation and Setup
Install spaCy using pip:
pip install spacy
Download the English language model:
python -m spacy download en_core_web_sm
The model name en_core_web_sm follows a naming convention where:
en ? English language
core ? General-purpose capabilities
web ? Trained on web text
sm ? Small model size
What is PoS Tagging?
Part-of-Speech (PoS) tagging is the process of assigning grammatical categories (noun, verb, adjective, etc.) to each word in a text. This helps machines understand the syntactic role and meaning of words within their context.
Example: Basic PoS Tagging
import spacy
# Load the English model
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Python programming can be used to perform various operations."
# Process the text
doc = nlp(text)
# Display each word with its PoS tag
for token in doc:
print(f"{token.text:12} {token.pos_}")
Python PROPN programming NOUN can AUX be AUX used VERB to PART perform VERB various ADJ operations NOUN . PUNCT
Example: Filtering Specific PoS Tags
You can extract specific grammatical categories from text:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)
# Extract adjectives
adjectives = [token.text for token in doc if token.pos_ == "ADJ"]
print("Adjectives:", adjectives)
# Extract nouns
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print("Nouns:", nouns)
Adjectives: ['quick', 'brown', 'lazy'] Nouns: ['fox', 'dog']
What is Lemmatization?
Lemmatization is the process of reducing inflected words to their base or dictionary form (lemma). Unlike stemming, lemmatization considers the word's context and part of speech to produce meaningful base forms.
Example: Lemmatization Process
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The cats are running and jumping in the gardens."
doc = nlp(text)
# Display original word and its lemma
print(f"{'Original':12} {'Lemma':12}")
print("-" * 24)
for token in doc:
if token.text != token.lemma_:
print(f"{token.text:12} {token.lemma_:12}")
Original Lemma ------------------------ cats cat are be running run jumping jump gardens garden
Combined PoS Tagging and Lemmatization
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The students were studying advanced algorithms."
doc = nlp(text)
print(f"{'Word':12} {'PoS':8} {'Lemma':12}")
print("-" * 32)
for token in doc:
print(f"{token.text:12} {token.pos_:8} {token.lemma_:12}")
Word PoS Lemma -------------------------------- The DET the students NOUN student were AUX be studying VERB study advanced ADJ advanced algorithms NOUN algorithm . PUNCT .
Common PoS Tags
| Tag | Description | Example |
|---|---|---|
| NOUN | Noun | cat, car |
| VERB | Verb | run, eat |
| ADJ | Adjective | big, red |
| PROPN | Proper noun | Python, John |
| PUNCT | Punctuation | ., ! |
Conclusion
spaCy provides powerful tools for PoS tagging and lemmatization that are essential for text preprocessing in NLP applications. PoS tagging helps identify grammatical roles while lemmatization reduces words to their base forms for better text analysis and understanding.
