Apache Tika - Metadata Extraction



Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file. If we consider an audio file, the artist name, album name, title comes under metadata.

XMP Standards

The Extensible Metadata Platform (XMP) is a standard for processing and storing information related to the content of a file. It was created by Adobe Systems Inc. XMP provides standards for defining, creating, and processing of metadata. You can embed this standard into several file formats such as PDF, JPEG, JPEG, GIF, jpg, HTML etc.

Property Class

Tika uses the Property class to follow XMP property definition. It provides the PropertyType and ValueType enums to capture the name and value of a metadata.

Metadata Class

This class implements various interfaces such as ClimateForcast, CativeCommons, Geographic, TIFF etc. to provide support for various metadata models. In addition, this class provides various methods to extract the content from a file.

Metadata Names

We can extract the list of all metadata names of a file from its metadata object using the method names(). It returns all the names as a string array. Using the name of the metadata, we can get the value using the get() method. It takes a metadata name and returns a value associated with it.

String[] metadaNames = metadata.names();

String value = metadata.get(name);

Extracting Metadata using Parse Method

Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters. This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object. Therefore, after parsing the file using parse(), we can extract the metadata from that object.

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   //empty metadata object 
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);

// now this metadata object contains the extracted metadata of the given file.
metadata.metadata.names();

Given below is the complete program to extract metadata from a text file.

TikaDemo.java

package com.tutorialspoint.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaDemo {
	
   public static void main(final String[] args) throws IOException, TikaException, SAXException {
	
      //Assume that boy.jpg is available
      File file = new File("D:/projects/boy.jpg");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();
      
      parser.parse(inputstream, handler, metadata, context);
      System.out.println(handler.toString());

      //getting the list of all meta data elements 
      String[] metadataNames = metadata.names();

      for(String name : metadataNames) {		        
         System.out.println(name + ": " + metadata.get(name));
      }
   }
}

Output

Given below is the snapshot of boy.jpg

jpg

If you execute the above program, it will give you the following output −

Resolution Units: inch
Number of Tables: 4 Huffman tables
File Modified Date: Mon Oct 27 11:58:38 +05:30 2025
Compression Type: Baseline
Data Precision: 8 bits
X-TIKA:Parsed-By-Full-Set: org.apache.tika.parser.DefaultParser
Number of Components: 3
tiff:ImageLength: 435
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Thumbnail Height Pixels: 0
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Image Height: 435 pixels
Thumbnail Width Pixels: 0
X Resolution: 96 dots
Image Width: 420 pixels
File Size: 40050 bytes
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
Version: 1.1
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
File Name: apache-tika-17701004029025818984.tmp
tiff:BitsPerSample: 8
tiff:ImageWidth: 420
Content-Type: image/jpeg
Y Resolution: 96 dots

We can also get our desired metadata values.

Adding New Metadata Values

We can add new metadata values using the add() method of the metadata class. Given below is the syntax of this method. Here we are adding the author name.

metadata.add(author,Tutorials point); 

The Metadata class has predefined properties including the properties inherited from classes like ClimateForcast, CativeCommons, Geographic, etc., to support various data models. Shown below is the usage of the SOFTWARE data type inherited from the TIFF interface implemented by Tika to follow XMP metadata standards for TIFF image formats.

metadata.add(Metadata.SOFTWARE,"ms paint");

Given below is the complete program that demonstrates how to add metadata values to a given file. Here the list of the metadata elements is displayed in the output so that you can observe the change in the list after adding new values.

TikaDemo.java

package com.tutorialspoint.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Arrays;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaDemo {

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //create a file object and assume sample.txt is available
      File file = new File("D:/projects/sample.txt");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the document
      parser.parse(inputstream, handler, metadata, context);

      //list of meta data elements before adding new elements
      System.out.println("metadata elements :"  +Arrays.toString(metadata.names()));

      //adding new meta data name value pair
      metadata.add("Author","Tutorials Point");
      System.out.println("metadata name value pair is successfully added");
      
      //printing all the meta data elements after adding new elements
      System.out.println("Here is the list of all the metadata elements after adding new elements");
      System.out.println( Arrays.toString(metadata.names()));
   }
}

Output

Given below is the content of sample.txt

Hi students welcome to tutorialspoint

If you execute the above program, it will give you the following output −

metadata elements :[X-TIKA:Parsed-By, X-TIKA:Parsed-By-Full-Set, Content-Encoding, X-TIKA:detectedEncoding, X-TIKA:encodingDetector, Content-Type]
metadata name value pair is successfully added
Here is the list of all the metadata elements after adding new elements
[X-TIKA:Parsed-By, X-TIKA:Parsed-By-Full-Set, Content-Encoding, Author, X-TIKA:detectedEncoding, X-TIKA:encodingDetector, Content-Type]

Setting Values to Existing Metadata Elements

You can set values to the existing metadata elements using the add() method. The syntax of setting the date property using the add() method is as follows −

metadata.add("Date", new Date().toString());

You can also set multiple values to the properties using the add() method. The syntax of setting multiple values to the Author property using the set() method is as follows −

metadata.set("Author", "ram ,raheem ,robin ");

Given below is the complete program demonstrating the add() method.

TikaDemo.java

package com.tutorialspoint.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import java.util.Date;
import java.util.GregorianCalendar;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaDemo {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {
   
      //Create a file object and assume sample.txt is available
      File file = new File("D:/projects/sample.txt");
      
      //parameters of parse() method
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();
      
      //Parsing the given file
      parser.parse(inputstream, handler, metadata, context);
     
      //list of meta data elements elements
      System.out.println("metadata elements and values of the given file :");
      String[] metadataNamesb4 = metadata.names();
      
      for(String name : metadataNamesb4) {
    	  System.out.println(name + ": " + metadata.get(name));
      }
      
      //setting date meta data 
      metadata.add("Date", new Date().toString());
      
      //setting multiple values to author property
      metadata.set("Author", "ram ,raheem ,robin ");
      
      //printing all the meta data elements with new elements
      System.out.println("List of all the metadata elements  after adding new elements ");
      String[] metadataNamesafter = metadata.names();
      
      for(String name : metadataNamesafter) {
         System.out.println(name + ": " + metadata.get(name));
      }
   }
}	    		

Output

Save the above code as SetMetadata.java and run it from the command prompt −

Given below is the content of sample.txt.

Hi students welcome to tutorialspoint

If you execute the above program it will give you the following output. In the output, you can observe the newly added metadata elements.

metadata elements and values of the given file :
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set: org.apache.tika.parser.DefaultParser
Content-Encoding: ISO-8859-1
X-TIKA:detectedEncoding: ISO-8859-1
X-TIKA:encodingDetector: UniversalEncodingDetector
Content-Type: text/plain; charset=ISO-8859-1
List of all the metadata elements  after adding new elements 
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set: org.apache.tika.parser.DefaultParser
Content-Encoding: ISO-8859-1
Author: ram ,raheem ,robin 
X-TIKA:detectedEncoding: ISO-8859-1
X-TIKA:encodingDetector: UniversalEncodingDetector
Date: Mon Oct 27 12:03:46 IST 2025
Content-Type: text/plain; charset=ISO-8859-1
Advertisements