Apache Tika - Extracting MS Office Files



Example - Extracting Content and Metadata from an Excel Sheet

Given below is the program to extract content and metadata from a Microsoft Office Excel Sheet.

TikaDemo.java

package com.tutorialspoint.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaDemo {

   public static void main(final String[] args) throws IOException, TikaException, SAXException {
      
      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("D:/projects/example.xlsx"));
      ParseContext pcontext = new ParseContext();
      
      //OOXml parser
      OOXMLParser  msofficeparser = new OOXMLParser (); 
      msofficeparser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the document:" + handler.toString());
      System.out.println("Metadata of the document:");
      String[] metadataNames = metadata.names();
      
      for(String name : metadataNames) {
         System.out.println(name + ": " + metadata.get(name));
      }
   }
}

Output

Here we are passing the following sample Excel file.

Passing Excel

The given Excel file has the following properties −

Excel Properties

After executing the above program you will get the following output.

Contents of the document:
Sheet1
	Name	Age	Designation	Salary
	Ramu	50	Manager	50000
	Raheem	40	Assistant Manager	40000
	Robert	30	Supervisor	30000
	Sita	25	Clerk	25000
	Sameer	25	Section Incharge	20000

Metadata of the document:
extended-properties:AppVersion: 16.0300
protected: false
extended-properties:Application: Microsoft Excel
meta:last-author: Mahesh Parashar
extended-properties:DocSecurityString: None
dc:creator: Mahesh Parashar
extended-properties:Company: 
dcterms:created: 2025-10-27T10:56:20Z
dcterms:modified: 2025-10-27T10:58:35Z
X-TIKA:origResourceName: D:\Projects\
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
dc:publisher: 
Advertisements