Pos tagging and lammetization using spacy in python


Python acts as an integral tool for understanding the concepts and application of machine learning and deep learning. It offers numerous libraries and modules that provides a magnificent platform for building useful techniques. In this article we will discuss about one such library known as “spaCy”.

spaCy is an open-source library and is used to analyse and compare textual data. We will discuss about this library in detail but before we dive deep into the topic, let’s quickly go through the overview of this article and understand the itinerary.

This article is divided into two sections −

  • In the first section we will understand the significance of spaCy and discuss the concepts of PoS tagging and lemmatization.

  • The second section will focus on the application of spaCy and the use of PoS tokening and lemmatization tokening.

What is spaCy?

spaCy is an open-source library used in deep learning. It is managed by the Natural Language Processing (NLP). NLP itself is a conceptual field of artificial intelligence. it paves the path for human-computer interaction by providing meaning to the human languages for machines. With the help of spaCy we process data at large scale and derive meaning for the machine.

spaCy is written in Cython and it provides interactive APIs.

Installation

spaCy is installed with the help of “pip”.

pip install spacy

Once spaCy is installed we can import it on our IDE. We will also load the pipeline package along by passing the correct naming convention. For PoS tagging and lemmatization we will use −

en_core_web_sm

This naming convention decides what kind of pipeline package we want. “en” decides the language, “core” decides the capabilities, “web” decides the genre and “sm” decides the size.

So this convention loads the package that is in English language and its capabilities are PoS tagging and lemmatization and it is trained on written web text.

What is pos tagging?

PoS (PART OF SPEECH) tagging is a technique of categorizing words in a textual data. We can analyse each word and understand its context and lateral meanings. We can grammatically check a speech and describe its structure.

It also includes unknow words and modifies the vocabulary. The passed dataset itself is deeply analysed. We can check which part of the speech is a verb, noun, pronoun, preposition etc.

What is lemmatization?

Lemmatization is the technique of grouping together terms or words of different versions that are the same word. It is an integral tool of NLP and is used to categorize inflected words found in a speech.

We can morphologically analyse the speech and target the words with inflected endings so that we can remove them. The entire logic of lemmatization is to gather the base word for an inflected word.

Example

We will construct a program to segregate different parts of the speech using spaCy. Firstly we will use PoS tagging and see how it functions −

Here,

  • We imported spacy after installing it on the command prompt.

  • We created a variable named “load_capabilites” that will initiate the “NLP”. We loaded a particular package i.e., “en_core_web_sm”.

  • We passed the textual data for analysis.

  • We created a variable named “Anadata”.

  • This Anadata will store all the words from the textual data for analysis in spacy.

  • We will iterate for a single word and then with the help of “word.pos_” we will perform PoS tagging for all the words.

import spacy
load_capabilites = spacy.load("en_core_web_sm")
data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules
and methods that are tailor made to perform various operations"""
Anadata = load_capabilites(data_text)
for word in Anadata:
   print(word, word.pos_)

Output

Python PROPN
programming NOUN
can AUX
be AUX
used VERB
to PART
perform VERB
numerous ADJ
mathematical ADJ
operations NOUN
and CCONJ
provide VERB
solutions NOUN
for ADP
different ADJ
problems NOUN
. PUNCT
 SPACE
Python PROPN
is AUX
a DET
very ADV
powerful ADJ
language NOUN
as SCONJ
it PRON
offers VERB
multiple ADJ
modules NOUN
and CCONJ
methods NOUN
that PRON
are AUX
tailor AUX
made VERB
to PART
perform VERB
various ADJ
operations NOUN

Here, each tag means something for example, “PROPN” means proper noun, “PUNC” means punctuation. “ADJ” means adjective.

Example

We can even pick single tags and print them separately.

import spacy
load_capabilites = spacy.load("en_core_web_sm")
data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules and methods that are tailor made to perform various operations"""
visdata = load_capabilites(data_text)
for word in visdata:
   pass
print("Ajectives:", [word.text for word in visdata if word.pos_ == "ADJ" ])

Output

Ajectives: ['numerous', 'mathematical', 'different', 'powerful', 'multiple', 'various']

Example

Now that we have understood how PoS tagging works, let’s understand the functioning of lemmatization.

import spacy
load_capabilites = spacy.load("en_core_web_sm")
data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules and methods that are tailor made to perform various operations"""
visdata = load_capabilites(data_text)
for word in visdata:
   print(word, word.lemma_)

Output

Python Python
programming programming
can can
be be
used use
to to
perform perform
numerous numerous
mathematical mathematical
operations operation
and and
provide provide
solutions solution
for for
different different
problems problem
. .
Python Python
is be
a a
very very
powerful powerful
language language
as as
it it
offers offer
multiple multiple
modules module
and and
methods method
that that
are be
tailor tailor
made make
to to
perform perform
various various
operations operation

Here, we used “lemma_” to perform lemmatization. All the inflected words are printed in their base form and now we can add these words on an external dictionary to enhance the local vocabulary.

Conclusion

In this article we covered the basic concepts of PoS tagging and lemmatization and understood its significance in deep learning. We also discussed the various applications through spaCy library and its role in NLP.

Updated on: 27-Feb-2023

701 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements