Custom corpus in NLP

Machine Learning Artificial Intelligence Network

Introduction

Corpus is known as a collection of text documents that are readable by a machine. Corpus has a particular structure to it. Various NLP operations can be performed on a corpus. Corpus readers are utilities that can read through these text files. Custom corpus is generated using NLTK data package. A special convention is followed to create a custom corpus. Corpora is the plural form of a corpus.

In this article let us understand briefly about corpora and how to create a custom corpus.

Custom corpus

A corpus can be in any of the given formats.

From original text present electronically.
From voice/speech data transcribed into text
From OCR tools that can extract text from documents.

WordNet, TreeBank, etc are some of the popular corpora in NLP.

Creating a custom corpus

We need the NLTK data library to create a custom corpus. We would define a custom path for the data i.e. nltk_data. nltk.data.path has certain paths that are recognized by nltk. For building a custom corpus our path should be present in this list of paths.

'/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data'

Next, we create a file named custom_corpus.txt within the ~/nltk_data/corpus folder and then load using NLKT. There are two types of loader formats in NLTK – raw and yaml.

## custom corpus NLP

!mkdir ~/nltk_data/corpus
!touch ~/nltk_data/corpus/custom_corpus.txt
!echo hello > ~/nltk_data/corpus/custom_corpus.txt
!touch ~/nltk_data/corpus/custom_corpus.yaml
!echo data:134 > ~/nltk_data/corpus/custom_corpus.yaml


import os, os.path
import nltk.data

data_path = os.path.expanduser('~/nltk_data')
  
# check whether path exists , otherwise create
if not os.path.exists(data_path):
    os.mkdir(data_path)
      
print ("Is the required path present : ", os.path.exists(data_path))
print ("Path exists within NLTK : ", data_path in nltk.data.path)

import nltk.data

## load text raw file
nltk.data.load("corpus/custom_corpus.txt", format = "raw")

## load yaml file
nltk.data.load("corpus/custom_corpus.yaml", format = "yaml")

Output

Is the required path present :  True
Path exists within NLTK :  True
{'data:134': None}

Conclusion

NLTK provides tools to create a custom corpus to be used in real-life applications. The NLTK, loader can read through the custom corpus, and well as inbuilt functions can be applied easily. A custom corpus is created using a defined structure and convection following a set of rules. There are many corpus readers that can be used to read through a custom corpus.

Mithilesh Pradhan

Updated on: 22-Sep-2023

124 Views

Kickstart Your Career

Get certified by completing the course

Get Started