Extract hyperlinks from PDF in Python

PythonTkinterServer Side ProgrammingProgramming

Python has a large set of libraries for handling different types of operations. To extract the data and meta-information from a PDF, we use the PyPdf2 package. It is easy to use and has many different operations or toolkits such as Extracting the data from the PDF, Searching Keyword in the Document, Extracting Meta Information such as finding Hyperlinks, URL and other information. Using the PyPDF2 package, we will extract the hyperlink from a pdf document.

We will follow these steps to extract the hyperlinks from a PDF,

  • Install PyPDF2 in the local machine by typing pip install PyPDF2 in the command shell.

  • Import PyPDF2.

  • Open the file in Binary mode and it recognizes the pattern of URL in the file.

  • Define a function to extract the link for a particular page.

  • Iterate over all the pages and extract the text using extractText() function.

  • To extract the hyperlinks from the PDF we generally use Pattern Matching Concept in Python. Now import re to find the pattern using regular expression.

  • Find the pattern that matches with http:// or https:// using findall(regex, string).

  • If any URL found return the URL and print it on the screen.

Example

# Import necessary packages
import PyPDF2
import re
# Open The File in the Command
file = open("newfile.pdf", 'rb')
readPDF = PyPDF2.PdfFileReader(file)
def find_url(string):
   #Find all the String that matches with the pattern
   regex = r"(https?://\S+)"
   url = re.findall(regex,string)
   for url in url:
      return url
# Iterating over all the pages of File
for page_no in range(readPDF.numPages):
   page=readPDF.getPage(page_no)
   #Extract the text from the page
   text = page.extractText()
   # Print all URL
   print(find_url(text))
# CLost the file
file.close()

Output

Running the above code will print all the hyperlinks available in the given PDF document file.

raja
Published on 21-Apr-2021 07:38:46
Advertisements