pdftohtml Command in Linux



The pdftohtml command in Linux converts a PDF file to HTML, XML, and PNG images. It is part of the poppler-utils package, which provides various utilities for handling PDF files. This tool can generate either a simple HTML file with embedded images or a more detailed HTML representation that preserves the layout of the PDF.

Table of Contents

Here is a comprehensive guide to the options available with the pdftohtml command −

Syntax of pdftohtml Command

The syntax of the pdftohtml command is as follows −

pdftohtml [options] [PDF-file] [<html-file> <xml-file>]

In the above syntax −

  • [options] − To customize the output
  • [PDF-file] − The input PDF file that needs to be converted. Provide the full path or the filename if it is in the current directory.
  • [<html-file> <xml-file>] − These are the output filenames. Note that pdftohtml auto-generates filenames based on the PDF unless explicitly specified.

pdftohtml Command Options

The options of the pdftohtml command are listed below −

Options Description
-f <int> Specifies the first page to convert
-l <int> Specifies the last page to convert
-q Suppresses output messages and errors
-p Replaces .pdf links with .html links
-c Generates a complex document
-s Generates a single document that includes all pages
-dataurls Embeds images as data URLs in the HTML
-i Ignores images during the conversion
-noframes Generates HTML without frames
-stdout Outputs the result to the standard output
-zoom <fp> Zooms the PDF document (default is 1.5)
-xml Generates XML output for post-processing
-noroundcoord Disables coordinate rounding (XML output only)
-hidden Includes hidden text in the output
-nomerge Prevents merging of paragraphs
-enc <string> Specifies the text encoding for the output
-fmt <string> Defines the image file format (png or jpg)
-opw <string> Specifies the owner password for encrypted files
-upw <string> Specifies the user password for encrypted files
-nodrm Overrides DRM restrictions in the document
-wbt <fp> Sets the word break threshold (default is 10%)
-fontfullname Outputs the full name of fonts used
-v Displays copyright and version information
-h, -?, -help, --help Displays usage information

Examples of pdftohtml Command in Linux

This section demonstrates the usage of the pdftohtml command in Linux with examples −

Converting a PDF to HTML

To convert a PDF file to HTML, use the following command −

pdftohtml document.pdf
pdftohtml Command in Linux1

The command generates HTML files and images in the current working directory. To define the output image formats, use the -fmt option. By default, images are saved in PNG format, to save them as JPG, use the following command −

pdftohtml -fmt jpg document.pdf

To prevent the merging of paragraphs, use the -nomerge option −

pdftohtml -nomerge document.pdf

Converting a PDF to XML

To convert a PDF to XML, use the -xml option −

pdftohtml -xml document.pdf
pdftohtml Command in Linux2

Converting Specific Pages of a PDF to HTML

To convert specific pages of a PDF file to HTML, use the -f and -l options. For example, to generate the HTML of a PDF document from pages 2 to 3, use the following command −

pdftohtml -f 2 -l 3 document.pdf

Converting a PDF to a Single Page HTML

By default, the pdftohtml command converts a PDF to HTML on multiple pages. To convert it into a single page, use the -s option −

pdftohtml -s document.pdf
pdftohtml Command in Linux3

Converting a PDF to HTML with a Different Zoom-Level

By default, the pdftohtml command zooms the PDF document to 1.5. To change it, use -zoom option −

pdftohtml -zoom 2.0 document.pdf

The above command converts a PDF file into an HTML file with a zoom level of 200%. Similarly, to scale the content to 50%, use 0.5.

Converting a PDF to HTML with Encoded Images

By default, the pdftohtml command saves the images as separate files. Instead of saving images as separate files, to encode them directly within the HTML as base64 data strings use the -dataurls option.

pdftohtml -dataurls document.pdf
pdftohtml Command in Linux4

Converting a PDF to HTML without Images

To ignore the images while conversion, use the -i option −

pdftohtml -i document.pdf

Converting an Encrypted PDF to HTML

The PDF file may be locked through a user password or owner password. The use the user password, use the -upw option −

pdftohtml -upw user-password document.pdf

To use the owner's password, use the -owp option −

pdftohtml -opw owner-password document.pdf

Including the Hidden Text while Conversion

To include the hidden text of the PDF in the output, use the -hidden option −

pdftohtml -hidden document.pdf

Hidden text is text that exists within the PDF but is not visible to the user under normal viewing conditions. This text might include metadata, annotations, or layers of content that are hidden in the default PDF view.

Displaying Help

To display help about the pdftohtml command, use any option from -h, -help, --help or -?

pdftohtml -?

Conclusion

The pdftohtml command in Linux is a versatile tool for converting PDF files into HTML, or XML. It provides various options to customize the output, such as defining page ranges, embedding images as data URLs, converting to a single-page HTML, or excluding images. Additionally, it can handle encrypted files and hidden text while offering features like zoom adjustment and encoding control.

In this tutorial, we explained the pdftohtml command, its syntax, options, and usage in Linux with examples.

Advertisements