- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Custom corpus in NLP
Introduction
Corpus is known as a collection of text documents that are readable by a machine. Corpus has a particular structure to it. Various NLP operations can be performed on a corpus. Corpus readers are utilities that can read through these text files. Custom corpus is generated using NLTK data package. A special convention is followed to create a custom corpus. Corpora is the plural form of a corpus.
In this article let us understand briefly about corpora and how to create a custom corpus.
Custom corpus
A corpus can be in any of the given formats.
From original text present electronically.
From voice/speech data transcribed into text
From OCR tools that can extract text from documents.
WordNet, TreeBank, etc are some of the popular corpora in NLP.
Creating a custom corpus
We need the NLTK data library to create a custom corpus. We would define a custom path for the data i.e. nltk_data. nltk.data.path has certain paths that are recognized by nltk. For building a custom corpus our path should be present in this list of paths.
'/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data'
Next, we create a file named custom_corpus.txt within the ~/nltk_data/corpus folder and then load using NLKT. There are two types of loader formats in NLTK – raw and yaml.
## custom corpus NLP !mkdir ~/nltk_data/corpus !touch ~/nltk_data/corpus/custom_corpus.txt !echo hello > ~/nltk_data/corpus/custom_corpus.txt !touch ~/nltk_data/corpus/custom_corpus.yaml !echo data:134 > ~/nltk_data/corpus/custom_corpus.yaml import os, os.path import nltk.data data_path = os.path.expanduser('~/nltk_data') # check whether path exists , otherwise create if not os.path.exists(data_path): os.mkdir(data_path) print ("Is the required path present : ", os.path.exists(data_path)) print ("Path exists within NLTK : ", data_path in nltk.data.path) import nltk.data ## load text raw file nltk.data.load("corpus/custom_corpus.txt", format = "raw") ## load yaml file nltk.data.load("corpus/custom_corpus.yaml", format = "yaml")
Output
Is the required path present : True Path exists within NLTK : True {'data:134': None}
Conclusion
NLTK provides tools to create a custom corpus to be used in real-life applications. The NLTK, loader can read through the custom corpus, and well as inbuilt functions can be applied easily. A custom corpus is created using a defined structure and convection following a set of rules. There are many corpus readers that can be used to read through a custom corpus.