How to search contents of multiple pdf files on Linux?

LinuxOperating SystemOpen Source

The pdfgrep command in Linux is used to filter searches for a particular pattern of characters in a PDF or multiple PDFs. It is one of the most used Linux utility commands to display the lines that contain the pattern that we are trying to search.

Normally, the pattern that we are trying to search in the file is referred to as the regular expression.

Installing Pdf grep

For Ubuntu/Fedora

sudo apt-get update -y
sudo apt-get install -y pdfgrep

For CentOS

yum install pdfgrep

Syntax

pdfgrep [options...] pattern [files]

While there are plenty of different options available to us, some of the most used are −

-c : counts the number of matches per input file.
-h : suppresses the prefixing of file name on output.
-i : Ignores, case for matching
-H : print the file name for each match
-n : prefix each match with the number of the page where it is found
-r : recursively search all files
-R : same as -r, but it also follows all symlinks.

Now, let’s consider a case where we want to find a particular pattern in all the pdf files in a particular directory, say dir1.

Syntax

pdfgrep -HiR "word" *

In the above command replace the “word” placeholder with

For that we make use of the command shown below −

pdfgrep -HiR "func main()" *

The above command will try to find a string “func main()” in all the files in a particular directory and also in the subdirectories as well.

Output

main.go:120:func main() {}

In case we only want to find a particular pattern in a single directory and not the subdirectories then we need to use the command shown below −

pdfgrep -i "func main()" *

In the above command we made use of the -s flag which will help us to not get a warning for each subdirectory that is present inside the directory where we are running the command.

Output

main.go:120:func main() {}

Another command that we can make use of is the find command.

Command

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}"
--color "func main()"' \;

Output

./main.go:func main() {
raja
Published on 30-Jul-2021 09:12:22
Advertisements