pdftotext Command in Linux



The pdftotext command in Linux reads a PDF file and converts its content into a text file. The pdftotext reads a PDF file and converts its content into a text file. If no text file is specified, it creates one with the same name as the PDF but with a .txt extension. Using a dash (-) as the text file name displays the output directly in the terminal instead of saving it.

Table of Contents

Here is a comprehensive guide to the options available with the pdftotext command −

Syntax of pdftotext Command

The syntax of the pdftotext command is as follows −

pdftotext [options] [PDF-file [text-file]]

In the above syntax, the [options] field is used to specify optional flags that modify how the command works. The [PDF-file] is the input file that needs to be converted. The [text-file] is the output text file where the extracted content will be saved. If omitted, the output file will have the same name as the input PDF but with a .txt extension.

pdftotext Command Options

The options of the Linux pdftotext command are listed below −

Option Description
-f <int> First page to convert
-l <int> Last page to convert
-r <fp> Resolution, in DPI (default is 72)
-x <int> X-coordinate of the crop area top-left corner
-y <int> Y-coordinate of the crop area top-left corner
-W <int> Width of crop area in pixels (default is 0)
-H <int> Height of crop area in pixels (default is 0)
-layout Maintain original physical layout
-fixed <fp> Assume fixed-pitch (or tabular) text
-raw Keep strings in content stream order
-nodiag Discard diagonal text
-htmlmeta Generate a simple HTML file, including the meta information
-tsv Generate a simple TSV file, including the meta information for bounding boxes
-enc <string> Output text encoding name
-listenc List available encodings
-eol <string> Output end-of-line convention (unix, dos, or mac)
-nopgbrk Don't insert page breaks between pages
-bbox Output bounding box for each word and page size to HTML. Sets -htmlmeta
-bbox-layout Like -bbox but with extra layout bounding box data. Sets -htmlmeta
-cropbox Use the crop box rather than media box
-colspacing <fp> Spacing allowed after a word before considering adjacent text to be a new column, as a fraction of the font size (default is 0.7)
-opw <string> Owner password (for encrypted files)
-upw <string> User password (for encrypted files)
-q Don't print any messages or errors
-v Print copyright and version info
-h, -help, --help, -? Print usage information

Examples of pdftotext Command in Linux

This section explains the usage of the pdftotext command with examples −

Converting a PDF to a Text File

To convert a PDF to a text file, use the following command −

pdftotext sample.pdf
pdftotext Command in Linux1

The above command converts the sample.pdf file to a text file and saves it in the current working directory with the name of sample.txt.

Converting Specific Pages of a PDF File

To convert the specific pages of a PDF file, use the -f or -l options. For example, to convert page 2 to page 4 of a PDF file to a text file, use the following command −

pdftotext -f 2 -l 4 sample.pdf

Preserving the Text Layout of a PDF File

The -layout option in pdftotext is used to preserve the physical appearance of the text as closely as possible to how it is laid out in the original PDF.

pdftotext -layout sample.pdf

By default, the pdftotext tries to output the text in reading order. This means text is extracted in a logical flow, ignoring the physical structure, such as columns or specific text alignment.

Displaying Text Directly in the Terminal

To display the text directly in the terminal, use the dash (-) with the pdftotext command in the following way −

pdftotext sample.pdf -
pdftotext Command in Linux2

Cropping a Specific Area of the PDF

To extract text from a specific area of the PDF file, use the -x, -y, -W, and -H options.

pdftotext -x 100 -y 50 -W 300 -H 200 sample.pdf
pdftotext Command in Linux3

Using Specific Encoding for Text Output

Use the -enc option to use specific encoding for the text output. To list all the supported encodings, use the following command −

pdftotext -listenc
pdftotext Command in Linux4

To set encoding, use the command given below −

pdftotext -enc UTF-16 sample.pdf

Excluding Diagonal Text from the Output

To exclude the diagonal text from the output, use the -nodiag option −

pdftotext -nodiag sample.pdf

The diagonal text is specifically used for watermarks.

Skipping Page Breaks

To skip the page breaks between the pages, use the -nopgbrk option −

pdftotext -nopgbrk sample.pdf

Setting DPI

The default Dots Per Inch (DPI) is 72. To set a different DPI, use the -r option. For example, to set DPI as 150, use the pdftotext command in the following way −

pdftotext -r 150 sample.pdf

Setting End of Line (EOL)

The -eol option in pdftotext specifies the end-of-line (EOL) convention used in the output text file. Depending on the target operating system or user preference it determines how line breaks are represented in the extracted text.

pdftotext -eol unix sample.pdf

The above command produces a file with \n as line breaks. Other types are dos and mac.

Note that the default EOL style depends on the system where pdftotext is running. Specifying -eol ensures the desired format is used, regardless of the system default.

Generating a Tab Separated Values (TSV) File

To generate a TSV file, use the -tsv option −

pdftotext -tsv sample.pdf
pdftotext Command in Linux5

This format is particularly useful for structured data, such as tables, where each table cell is separated by a tab character (\t), making it easy to import into spreadsheet software like Excel or to process with data tools.

Displaying Help

To display help about the pdftotext command, use any option from -h, -help, --help, or -?

pdftotext -?

Conclusion

The pdftotext command in Linux provides a handy way to convert PDF files into text files. Using various flags, specific pages can be extracted, the original layout preserved, areas cropped, and different text encodings applied, among other features. It is a versatile tool for converting PDF content into a text format that can be easily manipulated or analyzed.

Advertisements