PDFBox - Reading Text



In the previous chapter, we have seen how to add text to an existing PDF document. In this chapter, we will discuss how to read text from an existing PDF document.

Extracting Text from an Existing PDF Document

Extracting text is one of the main features of the PDF box library. You can extract text using the getText() method of the PDFTextStripper class. This class extracts all the text from the given PDF document.

Following are the steps to extract text from an existing PDF document.

Step 1: Loading an Existing PDF Document

Load an existing PDF document using the static method loadPDF() of the Loader class. This method accepts a RandomAccessReadBufferedFile object as a parameter, since this is a static method you can invoke it using class name as shown below.

// Loading an existing document 
PDDocument document = Loader.loadPDF(
   new RandomAccessReadBufferedFile("D:/Projects/PDFBox/PdfBox_Examples/sample.pdf"));

Step 2: Instantiate the PDFTextStripper Class

The PDFTextStripper class provides methods to retrieve text from a PDF document therefore, instantiate this class as shown below.

PDFTextStripper pdfStripper = new PDFTextStripper();

Step 3: Retrieving the Text

You can read/retrieve the contents of a page from the PDF document using the getText() method of the PDFTextStripper class. To this method you need to pass the document object as a parameter. This method retrieves the text in a given document and returns it in the form of a String object.

String text = pdfStripper.getText(document);

Step 4: Closing the Document

Finally, close the document using the close() method of the PDDocument class as shown below.

document.close();

Example - Reading text from a PDF Document

Suppose, we have a PDF document with some text in it as shown below.

Example PDF

This example demonstrates how to read text from the above mentioned PDF document. Here, we will create a Java program and load a PDF document named new.pdf, which is saved in the path D:/Projects/PDFBox/PdfBox_Examples/. Save this code in a file with name PDFBoxDemo.java.

PDFBoxDemo.java

package com.tutorialspoint.pdfbox;

import java.io.IOException;

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.io.RandomAccessReadBufferedFile;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFBoxDemo {
   public static void main(String args[]) throws IOException {

      // Loading an existing document 
      PDDocument document = Loader.loadPDF(
         new RandomAccessReadBufferedFile("D:/Projects/PDFBox/PdfBox_Examples/new.pdf")); 

      //Instantiate PDFTextStripper class
      PDFTextStripper pdfStripper = new PDFTextStripper();

      //Retrieving text from PDF document
      String text = pdfStripper.getText(document);
      System.out.println(text);

      //Closing the document
      document.close();
   }
}

Output

Compile and run the code to verify the following output −

This is an example of adding text to a page in the pdf document. we can add as many lines
as we want like this using the ShowText() method of the ContentStream class.
Advertisements