PDFBox - Overview
The Portable Document Format (PDF) is a file format that helps to present data in a manner that is independent of Application software, hardware, and operating systems.
Each PDF file holds description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it.
There are several libraries available to create and manipulate PDF documents through programs, such as −
Adobe PDF Library − This library provides API in languages such as C++, .NET and Java and using this we can edit, view print and extract text from PDF documents.
Formatting Objects Processor − Open-source print formatter driven by XSL Formatting Objects and an output independent formatter. The primary output target is PDF.
iText − This library provides API in languages such as Java, C#, and other .NET languages and using this library we can create and manipulate PDF, RTF and HTML documents.
JasperReports − This is a Java reporting tool which generates reports in PDF document including Microsoft Excel, RTF, ODT, comma-separated values and XML files.
What is a PDFBox
Apache PDFBox is an open-source Java library that supports the development and conversion of PDF documents. Using this library, you can develop Java programs that create, convert and manipulate PDF documents.
In addition to this, PDFBox also includes a command line utility for performing various operations over PDF using the available Jar file.
Features of PDFBox
Following are the notable features of PDFBox −
Extract Text − Using PDFBox, you can extract Unicode text from PDF files.
Split & Merge − Using PDFBox, you can divide a single PDF file into multiple files, and merge them back as a single file.
Fill Forms − Using PDFBox, you can fill the form data in a document.
Print − Using PDFBox, you can print a PDF file using the standard Java printing API.
Save as Image − Using PDFBox, you can save PDFs as image files, such as PNG or JPEG.
Create PDFs − Using PDFBox, you can create a new PDF file by creating Java programs and, you can also include images and fonts.
Signing− Using PDFBox, you can add digital signatures to the PDF files.
Applications of PDFBox
The following are the applications of PDFBox −
Apache Nutch − Apache Nutch is an open-source web-search software. It builds on Apache Lucene, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Apache Tika − Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Components of PDFBox
The following are the four main components of PDFBox −
PDFBox − This is the main part of the PDFBox. This contains the classes and interfaces related to content extraction and manipulation.
FontBox − This contains the classes and interfaces related to font, and using these classes we can modify the font of the text of the PDF document.
XmpBox − This contains the classes and interfaces that handle XMP metadata.
Preflight − This component is used to verify the PDF files against the PDF/A-1b standard.