Apache POI Word - Text Extraction



This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.

For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.

The following code shows how to extract simple text from a Word file −

// create a document object from existing work document
XWPFDocument docx = new XWPFDocument(new FileInputStream("example.docx"));
      
// using XWPFWordExtractor Class
XWPFWordExtractor we = new XWPFWordExtractor(docx);
// extract the text
System.out.println(we.getText());

Example - Extracting Text from a Document

ApachePoiDocDemo.java

package com.tutorialspoint;

import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ApachePoiDocDemo {
   public static void main(String[] args)throws Exception {
      XWPFDocument docx = new XWPFDocument(new FileInputStream("example.docx"));
      
      //using XWPFWordExtractor Class
      XWPFWordExtractor we = new XWPFWordExtractor(docx);
      System.out.println(we.getText());
      we.close();
   }
}

Output

It will generate the following output −

At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages.
The endeavour started by Mohtashim, an AMU alumni, who is the founder and the managing director of Tutorials Point (I) Pvt. Ltd. He came up with the website tutorialspoint.com in year 2006 with the helpof handpicked freelancers, with an array of tutorials for computer programming languages. 
Advertisements