Tesseract OCR with Java with Examples


Introduction

Optical Character Recognition (OCR) plays an instrumental role in digitizing printed text, allowing it to be edited, searched, and stored more compactly. One of the most powerful OCR tools available is Tesseract OCR. This article will explore how to use Tesseract OCR with Java, providing detailed examples to enhance your understanding.

What is Tesseract OCR?

Tesseract OCR is an open-source OCR engine sponsored by Google that can recognize more than 100 languages out of the box. It's widely regarded for its accuracy and adaptability, making it a popular choice for developers across various applications.

Integrating Tesseract OCR with Java

To integrate Tesseract OCR with Java, we need to use the Tesseract API for Java, typically known as Tess4J. Tess4J provides a Java JNA wrapper for Tesseract OCR API, bridging the gap between the Tesseract engine and Java applications.

Step 1: Setting Up the Environment

First, we need to install Tesseract OCR and Tess4J. Tesseract can be installed on Windows, Linux, and MacOS using their respective package managers. To include Tess4J in your Java project, you can add it as a Maven dependency −

<dependency>
   <groupId>net.sourceforge.tess4j</groupId>
   <artifactId>tess4j</artifactId>
   <version>4.5.4 </version> <!-- or whatever the latest version is -->
</dependency>

Step 2: Performing OCR on an Image

Below is a simple Java code snippet that performs OCR on an image file −

import net.sourceforge.tess4j.*;

public class OCRExample {
   public static void main(String[] args) {
     File imageFile = new File("path_to_your_image_file");
     ITesseract instance = new Tesseract();  // JNA Interface Mapping
     instance.setDatapath("path_to_tessdata"); // replace with your tessdata path

     try {
         String result = instance.doOCR(imageFile);
         System.out.println(result);
      } catch (TesseractException e) {
         System.err.println(e.getMessage());
      }
   }
}

In this example, we instantiate a Tesseract object and set the path to the tessdata directory, which contains language data files. We then call doOCR() on our image file, which returns a String containing the recognized text.

Step 3: Handling Multiple Languages

Tesseract OCR supports over 100 languages. To perform OCR with a different language, simply set the language on the Tesseract instance −

instance.setLanguage("fra"); // for French

Then, call doOCR() as usual −

try {
   String result = instance.doOCR(imageFile);
   System.out.println(result);
} catch (TesseractException e) {
   System.err.println(e.getMessage());
}

This will now perform OCR on the image using French language data.

Conclusion

Tesseract OCR, combined with Java, presents a powerful toolset for developers needing to implement OCR capabilities into their applications. The flexibility, accuracy, and extensive language support of Tesseract make it an excellent choice for a broad range of OCR tasks.

Updated on: 16-Jun-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements