Check if a String is Present in a Pdf File in Python


In today's digital world, PDF files have become an essential medium for storing and sharing information. However, sometimes it can be difficult to find a specific string of text within a PDF document, especially if the file is lengthy or complex. This is where Python, a popular programming language, comes in handy.

Python provides several libraries that allow us to interact with PDF files and extract information from them. One common task is to search for a particular string within a PDF file. This can be useful for various purposes, such as data analysis, text mining, or information retrieval.

In this context, we have a problem where we want to check whether a particular string exists in a PDF file or not. To solve this problem, we can use two different approaches.

The first approach involves directly searching for the string within the PDF file. This approach utilizes a PDF library that provides a search function to search for a string in the entire PDF file. The library reads the PDF file and performs the search operation on the file's content. This approach is fast and efficient, as it doesn't require iterating through every line of the PDF file.

The second approach involves iterating through each line of the PDF file and checking if the string exists in each line. This approach involves opening the PDF file, reading it line by line, and checking if the string is present in each line. This approach is slower and less efficient compared to the first approach, but it can be useful in certain cases where we need more fine−grained control over the search process, such as when we need to extract specific information from a PDF file.

In summary, the first approach involves directly searching for the string within the PDF file, while the second approach involves iterating through each line of the PDF file and checking if the string exists in each line. The choice of which approach to use depends on the specific requirements of the task at hand.

Now that we have talked enough about the approaches let's focus on writing the code for the first approach.

Approach 1

# The string we want to search for
St = 'Shruti'

# Open the PDF file in read mode
with open("example.pdf", "r") as f:
    # Read the entire file into a string variable 'a'
    a = f.read()

    # Check if the string 'St' is present in the file contents
    if St in a:
        # If the string is present, print a message indicating its presence
        print('String '', St, '' Is Found In The PDF File')
    else:
        # If the string is not present, print a message indicating its absence
        print('String '', St, '' Not Found')

# Close the file
f.close()

Explanation

In this code, we have a string St that we want to search for in a PDF file. We open the PDF file in read mode using the open() function, and the file is assigned to the variable f. The file name 'example.pdf' should be replaced with the name of the file you want to search for.

Next, we use the read() method to read the entire contents of the PDF file into a string variable a. This creates a string containing all the text in the PDF file.

We then check if the string St is present in the file contents using the in keyword. If the string is found in the PDF file, we print a message indicating its presence. If the string is not found, we print a message indicating its absence.

Finally, we close the file using the close() method to release any system resources associated with the file handle. This is an important step to ensure that we don't leave any files open unnecessarily, which could cause issues in the future.

Overall, this code provides a simple way to search for a string in a PDF file. However, it is important to note that this method may not work if the PDF file contains complex formatting, graphics, or images, as these elements may not be included in the string returned by the read() method. In such cases, it may be necessary to use a specialized PDF library to extract the text from the PDF file and search for the string within the extracted text.

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we will get the following output in the terminal.

Output

("String '", 'Shruti', "' Is Found In The PDF File")

Now let's focus on the second approach.

Approach 2

To check if a string exists in a PDF file, we can search line by line. First, we open the file and read its contents, which are stored in a variable called f. We set the line variable and counter both to zero to start iterating through the file line by line.

Using a for loop, we iterate through each line of the file and check if the string is present. If the string is found in the line, we print a message indicating its presence. Finally, we close the file to release any system resources associated with the file handle.

By searching line by line, we can more accurately locate the string within the PDF file. However, this approach may be slower than searching the entire file at once, especially for larger PDF files. Additionally, it is important to consider any formatting or other non−text elements in the file, which may require a specialized PDF library to handle.

Consider the code shown below.

Example

# Define the string to search for
St = 'Shruti'

# Open the PDF file in read mode
f = open("example.pdf", "r")

# Initialize counter variables
c = 0
line = 0

# Loop over each line in the file
for a in f:
    # Increment the line counter
    line = line + 1

    # Check if the string is present in the line
    if St in a:
        # Set the flag variable to indicate the string was found
        c = 1
        # Exit the loop once the string is found
        break

# Check the flag variable to see if the string was found
if c == 0:
    # Print a message indicating the string was not found
    print('String '', St, '' Not Found')
else:
    # Print a message indicating the line number where the string was found
    print('String '', St, '' Is Found In Line', line)

# Close the file to release any system resources associated with the file handle
f.close()

Explanation

This code searches for the string 'Shruti' in a PDF file named example.pdf. The file should be in the same directory as the Python script, or the full path to the file should be specified.

We start by defining the string we want to search for, and opening the PDF file in read mode using the open() function. The file object is assigned to the variable f.

We then initialize two variables: c is a flag variable that is set to 0, and line is a counter variable that is set to 0.

Next, we use a for loop to iterate over each line in the file. For each line, we increment the line counter. We then check if the string St is present in the line using the in operator. If it is, we set the c flag variable to 1 to indicate that the string was found, and break out of the loop using the break statement.

After the loop, we check the value of the c flag variable. If it is still 0, the string St was not found in the file, and we print a message indicating this. Otherwise, we print a message indicating the line number where the string was found using the print() function.

Finally, we close the file using the close() method to release any system resources associated with the file handle.

This approach can be useful for searching for a string in a large PDF file, as it allows us to stop searching once the string is found, rather than reading the entire file into memory. However, it is important to note that this method may not work if the PDF file contains complex formatting, graphics, or images, as these elements may not be included in the lines returned by the loop. In such cases, it may be necessary to use a specialized PDF library to extract the text from the PDF file and search for the string within the extracted text.

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we will get the following output in the terminal.

Output

("String '", 'Shruti', "' Is Found In Line", 3727)

Conclusion

In conclusion, checking if a string is present in a PDF file in Python can be accomplished using various approaches, depending on the requirements of the task at hand.

In this tutorial, we discussed two approaches to checking if a string exists in a PDF file: directly searching the entire PDF file or searching line by line. We also provided working examples of both methods, along with detailed explanations and code comments. By understanding these methods, you should be able to search for specific text within PDF files using Python, which can be a valuable tool for a variety of applications, such as data mining, text extraction, and more.

Updated on: 02-Aug-2023

261 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements