Apache Tika Examples

Apache Tika Resources

Selected Reading

Apache Tika - Language Detection

Quiz

Need for Language Detection

For classification of documents based on the language they are written in a multilingual website, a language detection tool is needed. This tool should accept documents without language annotation (metadata) and add that information in the metadata of the document by detecting the language.

Algorithms for Profiling Corpus

What is Corpus?

To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.

A corpus is a collection of texts of a written language that explains how the language is used in real situations.

The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.

What are Profiling Algorithms?

The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.

A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English.

Using Word Sets as Corpus

Using word sets, a simple algorithm is framed to find the distance between two corpora, which will be equal to the sum of differences between the frequencies of matching words.

Such algorithms suffer from the following problems −

Since the frequency of matching words is very less, the algorithm cannot efficiently work with small texts having few sentences. It needs a lot of text for accurate match.
It cannot detect word boundaries for languages having compound sentences, and those having no word dividers like spaces or punctuation marks.

Due to these difficulties in using word sets as corpus, individual characters or character groups are considered.

Using Character Sets as Corpus

Since the characters that are commonly used in a language are finite in number, it is easy to apply an algorithm based on word frequencies rather than characters. This algorithm works even better in case of certain character sets used in one or very few languages.

This algorithm suffers from the following drawbacks −

It is difficult to differentiate two languages having similar character frequencies.
There is no specific tool or algorithm to specifically identify a language with the help of (as corpus) the character set used by multiple languages.

N-gram Algorithm

The drawbacks stated above gave rise to a new approach of using character sequences of a given length for profiling corpus. Such sequence of characters are called as N-grams in general, where N represents the length of the character sequence.

N-gram algorithm is an effective approach for language detection, especially in case of European languages like English.
This algorithm works fine with short texts.
Though there are advanced language profiling algorithms to detect multiple languages in a multilingual document having more attractive features, Tika uses the 3-grams algorithm, as it is suitable in most practical situations.

Language Detection in Tika

Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageResult class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika −

daDanish	deGerman	etEstonian	elGreek
enEnglish	esSpanish	fiFinnish	frFrench
huHungarian	isIcelandic	itItalian	nlDutch
noNorwegian	plPolish	ptPortuguese	ruRussian
svSwedish	thThai

While instantiating the LanguageDetector class, you should use a language detector which we can choose from various vendors and using detect() method, detect the language of the text.

LanguageDetector detector = new OptimaizeLangDetector().loadModels();
LanguageResult result = detector.detect(text);

Given below is the example program for Language detection in Tika.

package com.tutorialspoint.tika;

import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;
import org.xml.sax.SAXException;

public class TikaDemo {

   public static void main(String args[])throws IOException, SAXException, TikaException {

      String text = "This is an example text in English.";

      LanguageDetector detector = new OptimaizeLangDetector().loadModels();
      LanguageResult result = detector.detect(text);

      if(result != null && result.isReasonablyCertain()) {
         String language = result.getLanguage();
         System.out.println("Language of the given content is : " + language);
      }else {
         System.out.println("Language detection failed.");
      }
   }
}

Output

If you execute the above program it gives the following output−

Language of the given content is : en

Language Detection of a Document

To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Pass the String format of the handler object to the detect of the LanguageDetector class as shown below −

parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());

Given below is the complete program that demonstrates how to detect the language of a given document −

TikaDemo.java

package com.tutorialspoint.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TikaDemo {

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("D:/projects/sample.txt");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Parsing the given document
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageDetector detector = new OptimaizeLangDetector().loadModels();
      LanguageResult result = detector.detect(handler.toString());

      if(result != null && result.isReasonablyCertain()) {
         String language = result.getLanguage();
         System.out.println("Language of the given content is : " + language);
      }else {
         System.out.println("Language detection failed.");
      }
   }
}

Output

Given below is the content of sample.txt.

Hi students welcome to tutorialspoint

If you execute the above program, it will give you the following output −

Language name :en

Along with the Tika jar, Tika provides a Graphical User Interface application (GUI) and a Command Line Interface (CLI) application. You can execute a Tika application from the command prompt too like other Java applications.

Previous Quiz Next