- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract hyperlinks from PDF in Python
Python has a large set of libraries for handling different types of operations. To extract the data and meta-information from a PDF, we use the PyPdf2 package. It is easy to use and has many different operations or toolkits such as Extracting the data from the PDF, Searching Keyword in the Document, Extracting Meta Information such as finding Hyperlinks, URL and other information. Using the PyPDF2 package, we will extract the hyperlink from a pdf document.
We will follow these steps to extract the hyperlinks from a PDF,
Install PyPDF2 in the local machine by typing pip install PyPDF2 in the command shell.
Import PyPDF2.
Open the file in Binary mode and it recognizes the pattern of URL in the file.
Define a function to extract the link for a particular page.
Iterate over all the pages and extract the text using extractText() function.
To extract the hyperlinks from the PDF we generally use Pattern Matching Concept in Python. Now import re to find the pattern using regular expression.
Find the pattern that matches with http:// or https:// using findall(regex, string).
If any URL found return the URL and print it on the screen.
Example
# Import necessary packages import PyPDF2 import re # Open The File in the Command file = open("newfile.pdf", 'rb') readPDF = PyPDF2.PdfFileReader(file) def find_url(string): #Find all the String that matches with the pattern regex = r"(https?://\S+)" url = re.findall(regex,string) for url in url: return url # Iterating over all the pages of File for page_no in range(readPDF.numPages): page=readPDF.getPage(page_no) #Extract the text from the page text = page.extractText() # Print all URL print(find_url(text)) # CLost the file file.close()
Output
Running the above code will print all the hyperlinks available in the given PDF document file.
- Related Articles
- Python – Extract Percentages from String
- Python – Extract elements from Ranges in List
- Working with PDF files in Python?
- Python Extract specific keys from dictionary?
- Extract digits from Tuple list Python
- Extract decimal numbers from a string in Python
- Extract numbers from list of strings in Python
- Extract only characters from given string in Python
- PDF Viewer for Python Tkinter
- How to Crack PDF Files in Python?
- Python – Extract Rear K digits from Numbers
- Python – Extract String elements from Mixed Matrix
- How to extract numbers from a string in Python?
- How to extract date from a string in Python?
- Convert PDF to CSV using Python
