Working with PDF files in Python?

Python provides excellent libraries for working with PDF files. PyPDF2 is a popular pure-Python library that can split, merge, crop, and transform PDF pages. It can also extract text, metadata, and add security features to PDF files.

Installation

Install PyPDF2 using pip ?

pip install PyPDF2

Verify the installation ?

import PyPDF2
print("PyPDF2 imported successfully!")
PyPDF2 imported successfully!

Extracting PDF Metadata

You can extract useful information like author, title, subject, and page count from any PDF file ?

from PyPDF2 import PdfFileReader

def extract_pdf_metadata():
    # Create a sample PDF content for demonstration
    print("PDF Metadata Example:")
    print("Author: Sample Author")
    print("Creator: Sample Creator") 
    print("Producer: Sample Producer")
    print("Subject: Sample Subject")
    print("Title: Sample PDF Document")
    print("Number of Pages: 10")

extract_pdf_metadata()
PDF Metadata Example:
Author: Sample Author
Creator: Sample Creator
Producer: Sample Producer
Subject: Sample Subject
Title: Sample PDF Document
Number of Pages: 10

Complete Metadata Extraction Function

from PyPDF2 import PdfFileReader

def extract_pdf_metadata(file_path):
    with open(file_path, 'rb') as file:
        pdf = PdfFileReader(file)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
        
    print("Author:", info.author)
    print("Creator:", info.creator)
    print("Producer:", info.producer)
    print("Subject:", info.subject)
    print("Title:", info.title)
    print("Number of Pages:", number_of_pages)

# Usage
# extract_pdf_metadata('document.pdf')

Extracting Text from PDFs

PyPDF2 can extract text from PDF pages, though the output may need cleaning ?

from PyPDF2 import PdfFileReader

def extract_text_from_page(file_path, page_number):
    with open(file_path, 'rb') as file:
        pdf = PdfFileReader(file)
        
        # Get specific page (0-indexed)
        page = pdf.getPage(page_number)
        
        # Extract text
        text = page.extractText()
        print(f"Text from page {page_number + 1}:")
        print(text)

# Usage
# extract_text_from_page('document.pdf', 0)

Rotating PDF Pages

You can rotate pages and save them to a new PDF file ?

import PyPDF2

def rotate_pdf_page(input_path, output_path, page_number, rotation):
    # Open the PDF file
    with open(input_path, 'rb') as input_file:
        pdf_reader = PyPDF2.PdfFileReader(input_file)
        
        # Get the page to rotate
        page = pdf_reader.getPage(page_number)
        
        # Rotate the page (90, 180, 270 degrees)
        if rotation == 90:
            page.rotateClockwise(90)
        elif rotation == 180:
            page.rotateClockwise(180)
        elif rotation == 270:
            page.rotateCounterClockwise(90)
        
        # Create a new PDF with the rotated page
        pdf_writer = PyPDF2.PdfFileWriter()
        pdf_writer.addPage(page)
        
        # Save the rotated page
        with open(output_path, 'wb') as output_file:
            pdf_writer.write(output_file)
    
    print(f"Page {page_number} rotated {rotation} degrees and saved to {output_path}")

# Usage
# rotate_pdf_page('input.pdf', 'rotated_output.pdf', 0, 90)

Common PDF Operations

Operation Method Use Case
Extract Metadata getDocumentInfo() Get document properties
Extract Text extractText() Get text content from pages
Rotate Pages rotateClockwise() Change page orientation
Get Page Count getNumPages() Count total pages

Limitations

PyPDF2 has some limitations to consider ?

  • Text Extraction: May not work well with complex layouts or scanned PDFs
  • Image Extraction: Limited support for extracting images
  • Encrypted PDFs: May require additional handling for password-protected files

Conclusion

PyPDF2 is a powerful library for basic PDF operations like metadata extraction, text extraction, and page manipulation. While it has limitations with complex PDFs, it's excellent for simple PDF processing tasks in Python.

Updated on: 2026-03-25T05:41:13+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements