Python Web Scraping Resources

Selected Reading

Python Web Scraping - Modules for Web Scraping

Quiz

In this chapter, let us learn various Python modules that we can use for web scraping.

Python Development Environments using virtualenv

Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation.

You can use the following command to install virtualenv −

(myenv) D:\Projects\python\myenv>pip3 install virtualenv
Collecting virtualenv
  Downloading virtualenv-20.36.1-py3-none-any.whl.metadata (4.7 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.4.0-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting filelock<4,>=3.20.1 (from virtualenv)
  Downloading filelock-3.20.3-py3-none-any.whl.metadata (2.1 kB)
Requirement already satisfied: platformdirs<5,>=3.9.1 in .\Lib\site-packages (from virtualenv) (4.5.1)
Downloading virtualenv-20.36.1-py3-none-any.whl (6.0 MB)
Downloading distlib-0.4.0-py2.py3-none-any.whl (469 kB)
Downloading filelock-3.20.3-py3-none-any.whl (16 kB)
Installing collected packages: distlib, filelock, virtualenv
Successfully installed distlib-0.4.0 filelock-3.20.3 virtualenv-20.36.1

Now, we need to create a directory which will represent the project with the help of following command −

(myenv) D:\Projects\python\myenv>mkdir webscrap

Now, enter into that directory with the help of this following command −

(myenv) D:\Projects\python\myenv>cd webscrap

Now, we need to initialize virtual environment folder of our choice as follows −

(base) D:\ProgramData\webscrap>virtualenv websc
created virtual environment CPython3.14.2.final.0-64 in 981ms
  creator CPython3Windows(dest=D:\Projects\python\myenv\webscrap\websc, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, via=copy, app_data_dir=C:\Users\mahes\AppData\Local\pypa\virtualenv)
    added seed packages: pip==25.3
  activators BashActivator,BatchActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets.

(myenv) D:\Projects\python\myenv\webscrap>websc\scripts\activate

We can install any module in this environment as follows −

(websc) D:\Projects\python\myenv\webscrap>pip3 install requests
Collecting requests
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.4.4-cp314-cp314-win_amd64.whl.metadata (38 kB)
Collecting idna<4,>=2.5 (from requests)
  Using cached idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2026.1.4-py3-none-any.whl.metadata (2.5 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Using cached charset_normalizer-3.4.4-cp314-cp314-win_amd64.whl (107 kB)
Using cached idna-3.11-py3-none-any.whl (71 kB)
Downloading urllib3-2.6.3-py3-none-any.whl (131 kB)
Downloading certifi-2026.1.4-py3-none-any.whl (152 kB)
Installing collected packages: urllib3, idna, charset_normalizer, certifi, requests
Successfully installed certifi-2026.1.4 charset_normalizer-3.4.4 idna-3.11 requests-2.32.5 urllib3-2.6.3

For deactivating the virtual environment, we can use the following command −

(websc) D:\Projects\python\myenv\webscrap>deactivate
D:\Projects\python\myenv\webscrap>

You can see that (websc) has been deactivated.

Python Modules for Web Scraping

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

In this section, we are going to discuss about useful Python libraries for web scraping.

Requests

It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.

Installing Requests

We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows −

(myenv) D:\Projects\python\myenv>pip3 install requests
Requirement already satisfied: requests in .\Lib\site-packages (2.32.5)
Requirement already satisfied: charset_normalizer=2 in .\Lib\site-packages (from requests) (3.4.4)
Requirement already satisfied: idna=2.5 in .\Lib\site-packages (from requests) (3.11)
Requirement already satisfied: urllib3=1.21.1 in .\Lib\site-packages (from requests) (2.6.2)
Requirement already satisfied: certifi>=2017.4.17 in .\Lib\site-packages (from requests) (2025.11.12)

Example

In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows −

(websc) D:\Projects\python\myenv\webscrap>py
Python 3.14.2 (tags/v3.14.2:df79316, Dec  5 2025, 17:18:21) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

>>> r = requests.get('https://authoraditiagarwal.com/')

Now we can retrieve the content by using .text property as follows −

>>> r.text[:200]
'<!DOCTYPE html><html lang="en-US" id="html"><head><meta charset="UTF-8" /><meta http-equiv="X-UA-Compatible" content="IE=10" /><link rel="profile" href="http://gmpg.org/xfn/11" /><link rel="pingback" '

Observe that in the following output, we got the first 200 characters.

Out[5]: '<!DOCTYPE html>\n<html lang="en-US"\n\titemscope
\n\titemtype="http://schema.org/WebSite" \n\tprefix="og: http://ogp.me/ns#"
>\n<head>\n\t<meta charset
="UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE'

Urllib3

It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/.

Installing Urllib3

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

(websc) D:\Projects\python\myenv\webscrap>pip3 install urllib3 bs4
Requirement already satisfied: urllib3 in d:\projects\python\myenv\webscrap\websc\lib\site-packages (2.6.3)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4->bs4)
  Downloading soupsieve-2.8.3-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4->bs4)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Using cached beautifulsoup4-4.14.3-py3-none-any.whl (107 kB)
Downloading soupsieve-2.8.3-py3-none-any.whl (37 kB)
Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing-extensions, soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.14.3 bs4-0.0.2 soupsieve-2.8.3 typing-extensions-4.15.0

Example: Scraping using Urllib3 and BeautifulSoup

In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.

main.py

import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'html.parser')
print (soup.title)
print (soup.title.text)

This is the output you will observe when you run this code −

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Selenium

It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the link Selenium.

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

Installing Selenium

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

pip install selenium

As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same.

Chrome	https://www.google.com/intl/en_in/chrome/
Edge	https://developer.microsoft.com/
Firefox	https://github.com/
Safari	https://webkit.org/

Example

This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.

After downloading the particular driver for the specified version of browser, we need to do programming in Python.

First, need to import webdriver from selenium as follows −

from selenium import webdriver

Now, provide the path of web driver which we have downloaded as per our requirement −

path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)

Now, provide the url which we want to open in that web browser now controlled by our Python script.

browser.get('https://authoraditiagarwal.com/leadershipmanagement')

We can also scrape a particular element by providing the xpath as provided in lxml.

browser.find_element_by_xpath('/html/body').click()

You can check the browser, controlled by Python script, for output.

Scrapy

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.

Installing Scrapy

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

pip3 install scrapy

For more detail study of Scrapy you can go to the link Scrapy

Previous Quiz Next