Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Working with PDF files in Python?
Python provides excellent libraries for working with PDF files. PyPDF2 is a popular pure-Python library that can split, merge, crop, and transform PDF pages. It can also extract text, metadata, and add security features to PDF files.
Installation
Install PyPDF2 using pip ?
pip install PyPDF2
Verify the installation ?
import PyPDF2
print("PyPDF2 imported successfully!")
PyPDF2 imported successfully!
Extracting PDF Metadata
You can extract useful information like author, title, subject, and page count from any PDF file ?
from PyPDF2 import PdfFileReader
def extract_pdf_metadata():
# Create a sample PDF content for demonstration
print("PDF Metadata Example:")
print("Author: Sample Author")
print("Creator: Sample Creator")
print("Producer: Sample Producer")
print("Subject: Sample Subject")
print("Title: Sample PDF Document")
print("Number of Pages: 10")
extract_pdf_metadata()
PDF Metadata Example: Author: Sample Author Creator: Sample Creator Producer: Sample Producer Subject: Sample Subject Title: Sample PDF Document Number of Pages: 10
Complete Metadata Extraction Function
from PyPDF2 import PdfFileReader
def extract_pdf_metadata(file_path):
with open(file_path, 'rb') as file:
pdf = PdfFileReader(file)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print("Author:", info.author)
print("Creator:", info.creator)
print("Producer:", info.producer)
print("Subject:", info.subject)
print("Title:", info.title)
print("Number of Pages:", number_of_pages)
# Usage
# extract_pdf_metadata('document.pdf')
Extracting Text from PDFs
PyPDF2 can extract text from PDF pages, though the output may need cleaning ?
from PyPDF2 import PdfFileReader
def extract_text_from_page(file_path, page_number):
with open(file_path, 'rb') as file:
pdf = PdfFileReader(file)
# Get specific page (0-indexed)
page = pdf.getPage(page_number)
# Extract text
text = page.extractText()
print(f"Text from page {page_number + 1}:")
print(text)
# Usage
# extract_text_from_page('document.pdf', 0)
Rotating PDF Pages
You can rotate pages and save them to a new PDF file ?
import PyPDF2
def rotate_pdf_page(input_path, output_path, page_number, rotation):
# Open the PDF file
with open(input_path, 'rb') as input_file:
pdf_reader = PyPDF2.PdfFileReader(input_file)
# Get the page to rotate
page = pdf_reader.getPage(page_number)
# Rotate the page (90, 180, 270 degrees)
if rotation == 90:
page.rotateClockwise(90)
elif rotation == 180:
page.rotateClockwise(180)
elif rotation == 270:
page.rotateCounterClockwise(90)
# Create a new PDF with the rotated page
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(page)
# Save the rotated page
with open(output_path, 'wb') as output_file:
pdf_writer.write(output_file)
print(f"Page {page_number} rotated {rotation} degrees and saved to {output_path}")
# Usage
# rotate_pdf_page('input.pdf', 'rotated_output.pdf', 0, 90)
Common PDF Operations
| Operation | Method | Use Case |
|---|---|---|
| Extract Metadata | getDocumentInfo() |
Get document properties |
| Extract Text | extractText() |
Get text content from pages |
| Rotate Pages | rotateClockwise() |
Change page orientation |
| Get Page Count | getNumPages() |
Count total pages |
Limitations
PyPDF2 has some limitations to consider ?
- Text Extraction: May not work well with complex layouts or scanned PDFs
- Image Extraction: Limited support for extracting images
- Encrypted PDFs: May require additional handling for password-protected files
Conclusion
PyPDF2 is a powerful library for basic PDF operations like metadata extraction, text extraction, and page manipulation. While it has limitations with complex PDFs, it's excellent for simple PDF processing tasks in Python.
