Downloading PDFs with Python using Requests and BeautifulSoup

Python Server Side Programming Programming

Request and BeautifulSoup are Python libraries that can download any file or PDF online. The request library is used to send HTTP requests and receive responses. BeautifulSoup library is used to parse the HTML received in the response and get the downloadable pdf link.In this article, we will understand how we can download PDFs using Request and Beautiful Soup in Python.

Installing Dependencies

Before using the BeautifulSoup and Request libraries in Python we need to install the libraries in our system using the pip command. To install the request and BeautifulSoup and Request library run the following command in your terminal.

pip install requests
pip install beautifulsoup4

Downloading PDFs using Request and Beautiful Soup

To download PDFs from the internet you need to first find the URL of the pdf file using the request library. We can then use Beautiful Soup to parse the HTML response and extract the link to the PDF file. The base URL and the PDF link received after parsing is then combined to get the URL of the PDF file. Now we can use the request method and sent Get request to download the file.

Example

In the below code put the valid url of the page that contains the URL of the PDF file at the plae of ‘https://example.com/document.pdf’

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the PDF URL
url = 'https://example.com/document.pdf'
response = requests.get(url)

if response.status_code == 200:
   # Step 2: Parse the HTML to get the PDF link
   soup = BeautifulSoup(response.text, 'html.parser')
   link = soup.find('a')['href']

   # Step 3: Download the PDF
   pdf_url = url + link
   pdf_response = requests.get(pdf_url)

   if pdf_response.status_code == 200:
      with open('document.pdf', 'wb') as f:
         f.write(pdf_response.content)
      print('PDF downloaded successfully.')
   else:
      print('Error:', pdf_response.status_code)
else:
   print('Error:', response.status_code)

Output

PDF downloaded successfully.

Conclusion

In this article, we discussed how we can download PDF files from the internet using Request and Beautiful Soup libraries in Python. With the request method, we can send an HTTP requests to validate the PDF link. Once the page containing the PDF file link is found we can use Beautiful Soup to download the parse the page and get the PDF downloadable link.

Rohan Singh

Updated on: 10-Jul-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started