Apache Tika Tutorial

Apache Tika Tutorial

What is Apache Tika?

Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data. Using Tika, one can develop a universal type detector and content extractor to extract both structured text as well as metadata from different types of documents such as spreadsheets, text documents, images, PDFs and even multimedia input formats to a certain extent.

This Apache Tika tutorial is based on the latest Apache Tika 3.2.3 version.

Who Should Learn Apache Tika?

This tutorial is tailored for readers who aim to understand and utilize Apache Tika capability for document type detection and content extraction using Java programming language. In this tutorial, we'll cover all the ways of using Apache Tika which helps in solving the common problems developers/users face during Apache Tika based development.

Prerequisites to Learn Apache Tika?

To maximize the benefits of this tutorial, readers should have a basic understanding of Java programming. Knowledge of I/O Operations, File handling will enhance comprehension. A basic understanding of Eclipse IDE is also required because all the examples have been compiled using Eclipse IDE.

Advertisements