Check if a String is Present in a Pdf File in Python

In today's digital world, PDF files have become an essential medium for storing and sharing information. Python provides several libraries that allow us to interact with PDF files and extract information from them. One common task is to search for a particular string within a PDF file.

However, the simple text-based approach shown below has significant limitations. Opening a PDF file as plain text will not work properly because PDFs contain binary data, formatting, and metadata. For real PDF processing, you should use specialized libraries like PyPDF2 or pdfplumber.

Basic Text Search Approach (Limited)

This approach treats the PDF as a text file, which has major limitations but demonstrates basic string searching concepts ?

# Basic approach - Limited functionality
search_string = 'Python'

try:
    with open("sample.txt", "r", encoding='utf-8') as f:
        content = f.read()
        
        if search_string in content:
            print(f"String '{search_string}' found in the file")
        else:
            print(f"String '{search_string}' not found in the file")
            
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Error reading file: {e}")
String 'Python' found in the file

Line-by-Line Search Approach

This method searches through each line and reports the line number where the string is found ?

search_string = 'Python'
found = False
line_number = 0

try:
    with open("sample.txt", "r", encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            if search_string in line:
                print(f"String '{search_string}' found at line {line_num}")
                found = True
                break
        
        if not found:
            print(f"String '{search_string}' not found in the file")
            
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Error reading file: {e}")
String 'Python' found at line 1

Proper PDF Processing with PyPDF2

For actual PDF files, use a specialized library. Here's how to properly search PDF content ?

# First install: pip install PyPDF2
import PyPDF2

def search_pdf(file_path, search_string):
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            
            for page_num, page in enumerate(pdf_reader.pages):
                text = page.extract_text()
                if search_string.lower() in text.lower():
                    return f"String '{search_string}' found on page {page_num + 1}"
            
            return f"String '{search_string}' not found in PDF"
    
    except Exception as e:
        return f"Error processing PDF: {e}"

# Usage
result = search_pdf("document.pdf", "Python")
print(result)

Comparison of Approaches

Method Speed Works with PDF? Best For
Full file read Fast No Text files only
Line-by-line Medium No Large text files
PyPDF2 Slower Yes Actual PDF files

Key Limitations

The text-based approaches shown have several important limitations:

  • PDFs contain binary data that cannot be read as plain text
  • Formatting, images, and metadata are not accessible
  • Text encoding issues may cause errors
  • Complex PDF structures require specialized parsing

Conclusion

While basic string searching demonstrates fundamental concepts, real PDF processing requires specialized libraries like PyPDF2 or pdfplumber. Use the text-based approach only for actual text files, not PDF documents.

Updated on: 2026-03-27T10:58:45+05:30

872 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements