OpenNLP - Referenced API



In this chapter, we will discuss about the classes and methods that we will be using in the subsequent chapters of this tutorial.

Sentence Detection

SentenceModel class

This class represents the predefined model which is used to detect the sentences in the given raw text. This class belongs to the package opennlp.tools.sentdetect.

The constructor of this class accepts an InputStream object of the sentence detector model file (en-sent.bin).

SentenceDetectorME class

This class belongs to the package opennlp.tools.sentdetect and it contains methods to split the raw text into sentences. This class uses a maximum entropy model to evaluate end-ofsentence characters in a string to determine if they signify the end of a sentence.

Following are the important methods of this class.

S.No Methods and Description
1

sentDetect()

This method is used to detect the sentences in the raw text passed to it. It accepts a String variable as a parameter and returns a String array which holds the sentences from the given raw text.

2

sentPosDetect()

This method is used to detect the positions of the sentences in the given text. This method accepts a string variable, representing the sentence and returns an array of objects of the type Span.

The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.

3

getSentenceProbabilities()

This method returns the probabilities associated with the most recent calls to sentDetect() method.

Tokenization

TokenizerModel class

This class represents the predefined model which is used to tokenize the given sentence. This class belongs to the package opennlp.tools.tokenizer.

The constructor of this class accepts a InputStream object of the tokenizer model file (entoken.bin).

Classes

To perform tokenization, the OpenNLP library provides three main classes. All the three classes implement the interface called Tokenizer.

S.No Classes and Description
1

SimpleTokenizer

This class tokenizes the given raw text using character classes.

2

WhitespaceTokenizer

This class uses whitespaces to tokenize the given text.

3

TokenizerME

This class converts raw text in to separate tokens. It uses Maximum Entropy to make its decisions.

These classes contain the following methods.

S.No Methods and Description
1

tokenize()

This method is used to tokenize the raw text. This method accepts a String variable as a parameter, and returns an array of Strings (tokens).

2

sentPosDetect()

This method is used to get the positions or spans of the tokens. It accepts the sentence (or) raw text in the form of the string and returns an array of objects of the type Span.

In addition to the above two methods, the TokenizerME class has the getTokenProbabilities() method.

S.No Methods and Description
1

getTokenProbabilities()

This method is used to get the probabilities associated with the most recent calls to the tokenizePos() method.

NameEntityRecognition

TokenNameFinderModel class

This class represents the predefined model which is used to find the named entities in the given sentence. This class belongs to the package opennlp.tools.namefind.

The constructor of this class accepts a InputStream object of the name finder model file (enner-person.bin).

NameFinderME class

The class belongs to the package opennlp.tools.namefind and it contains methods to perform the NER tasks. This class uses a maximum entropy model to find the named entities in the given raw text.

S.No Methods and Description
1

find()

This method is used to detect the names in the raw text. It accepts a String variable representing the raw text as a parameter and, returns an array of objects of the type Span.

2

probs()

This method is used to get the probabilities of the last decoded sequence.

Finding the Parts of Speech

POSModel class

This class represents the predefined model which is used to tag the parts of speech of the given sentence. This class belongs to the package opennlp.tools.postag.

The constructor of this class accepts a InputStream object of the pos-tagger model file (enpos-maxent.bin).

POSTaggerME class

This class belongs to the package opennlp.tools.postag and it is used to predict the parts of speech of the given raw text. It uses Maximum Entropy to make its decisions.

S.No Methods and Description
1

tag()

This method is used to assign the sentence of tokens POS tags. This method accepts an array of tokens (String) as a parameter, and returns a tags (array).

2

getSentenceProbabilities()

This method is used to get the probabilities for each tag of the recently tagged sentence.

Parsing the Sentence

ParserModel class

This class represents the predefined model which is used to parse the given sentence. This class belongs to the package opennlp.tools.parser.

The constructor of this class accepts a InputStream object of the parser model file (en-parserchunking.bin).

Parser Factory class

This class belongs to the package opennlp.tools.parser and it is used to create parsers.

S.No Methods and Description
1

create()

This is a static method and it is used to create a parser object. This method accepts the Filestream object of the parser model file.

ParserTool class

This class belongs to the opennlp.tools.cmdline.parser package and, it is used to parse the content.

S.No Methods and Description
1

parseLine()

This method of the ParserTool class is used to parse the raw text in OpenNLP. This method accepts −

  • A String variable representing the text to be parsed.
  • A parser object.
  • An integer representing the no.of parses to be carried out.

Chunking

ChunkerModel class

This class represents the predefined model which is used to divide a sentence into smaller chunks. This class belongs to the package opennlp.tools.chunker.

The constructor of this class accepts a InputStream object of the chunker model file (enchunker.bin).

ChunkerME class

This class belongs to the package named opennlp.tools.chunker and it is used to divide the given sentence in to smaller chunks.

S.No Methods and Description
1

chunk()

This method is used to divide the given sentence in to smaller chunks. It accepts tokens of a sentence and Parts Of Speech tags as parameters.

2

probs()

This method returns the probabilities of the last decoded sequence.

Advertisements