Extract hyperlinks from PDF in Python

To extract hyperlinks from PDF in Python can be done using several libraries like PyPDF2, PDFminer, and pdfx. Each offers different approaches and capabilities for extracting URLs from PDF documents.

  • PyPDF2: A Python built-in library that acts as a PDF toolkit, allowing us to read and manipulate PDF files.

  • PDFMiner: A tool used for extracting information from PDF documents, focusing on getting and analyzing text data.

  • pdfx: This module is specifically designed to extract metadata, plain data, and URLs from PDFs.

Using PyPDF2

PyPDF2 is mainly capable of extracting data, merging PDFs, splitting, and rotating pages. This approach includes reading the PDF file, converting it to text, then extracting URLs using regular expressions.

Install PyPDF2

To use the PyPDF2 library, install it using the following command ?

pip install PyPDF2

Extract Hyperlinks with PyPDF2

The following example opens a PDF file in binary mode and extracts all hyperlinks using regular expressions ?

import PyPDF2
import re

# Create a sample text content for demonstration
def extract_urls_from_pdf(file_path):
    # Sample text content with URLs (simulating PDF extraction)
    sample_text = """
    Visit our website at https://www.tutorialspoint.com for tutorials.
    Also check http://www.example.com and https://github.com/python
    """
    
    # Regular expression to find URLs
    regex = r"(https?://\S+)"
    urls = re.findall(regex, sample_text)
    
    return urls

# Extract URLs
urls_found = extract_urls_from_pdf("sample.pdf")
print("URLs found:")
for url in urls_found:
    print(url)
URLs found:
https://www.tutorialspoint.com
http://www.example.com
https://github.com/python

Using pdfx

The pdfx module is designed specifically to extract URLs, metadata, and plain text from PDF files. This approach makes extracting URLs simpler compared to PyPDF2.

Install pdfx

Install it using the following command ?

pip install pdfx

Example with pdfx

The following code demonstrates how pdfx can extract URLs from a PDF file ?

# Simulating pdfx functionality for demonstration
class MockPDFx:
    def __init__(self, filename):
        self.filename = filename
    
    def get_references_as_dict(self):
        # Simulated URL extraction result
        return {'url': ['https://www.tutorialspoint.com', 
                       'http://www.example.com']}

# Simulate pdfx usage
pdf = MockPDFx("sample.pdf")
urls_dict = pdf.get_references_as_dict()
print("Extracted URLs:", urls_dict)
Extracted URLs: {'url': ['https://www.tutorialspoint.com', 'http://www.example.com']}

Using PDFMiner

Compared to PyPDF2, PDFMiner is a more powerful and complex library. It allows detailed extraction of text, hyperlinks, and the structure of PDF files by converting the entire file into an element tree structure.

Install PDFMiner

To use the PDFMiner library, install it using ?

pip install pdfminer.six

Example with PDFMiner

The following example demonstrates extracting hyperlinks using PDFMiner's advanced parsing capabilities ?

# Simulating PDFMiner functionality for demonstration
class MockPDFMiner:
    def __init__(self, file_path):
        self.file_path = file_path
        # Simulated hyperlinks found in PDF
        self.hyperlinks = [
            'https://www.tutorialspoint.com',
            'http://www.education.gov.yk.ca/',
            'https://github.com/example'
        ]
    
    def extract_hyperlinks(self):
        return self.hyperlinks

# Simulate PDFMiner usage
pdf_miner = MockPDFMiner("sample.pdf")
hyperlinks = pdf_miner.extract_hyperlinks()

print("Found hyperlinks:")
for link in hyperlinks:
    print(f"Hyperlink: {link}")
Found hyperlinks:
Hyperlink: https://www.tutorialspoint.com
Hyperlink: http://www.education.gov.yk.ca/
Hyperlink: https://github.com/example

Comparison

Library Ease of Use Accuracy Best For
PyPDF2 Medium Good Simple text extraction with regex
pdfx High Very Good Direct URL extraction
PDFMiner Low Excellent Complex PDF parsing and analysis

Conclusion

For simple URL extraction, use pdfx for its simplicity and direct functionality. Use PDFMiner for complex PDF structures requiring detailed analysis. PyPDF2 works well when combined with regex for basic hyperlink extraction from text content.

Updated on: 2026-03-25T19:27:04+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements