Apache POI Word - Text Extraction



This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.

For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.

The following code shows how to extract simple text from a Word file −

import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordExtractor {
   public static void main(String[] args)throws Exception {
      XWPFDocument docx = new XWPFDocument(new FileInputStream("createparagraph.docx"));
      
      //using XWPFWordExtractor Class
      XWPFWordExtractor we = new XWPFWordExtractor(docx);
      System.out.println(we.getText());
   }
}

Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −

$javac WordExtractor.java
$java WordExtractor

It will generate the following output −

At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose
in the domains of Academics, Information Technology, Management and Computer Programming Languages.
Advertisements