Python Tools for Web scraping

PythonProgrammingServer Side Programming

In computer science Web scraping means extracting data from websites. Using this technique transform the unstructured data on the web into structured data.

Most common web Scraping tools In Python3 are −

  • Urllib2
  • Requests
  • BeautifulSoup
  • Lxml
  • Selenium
  • MechanicalSoup

Urllib2 − This tool is pre-installed with Python. This module is used for extracting the URL's. Using urlopen () function fetching the URL's using different protocols (FTP, HTTPetc.).

Example code

from urllib.request import urlopen
my_html = urlopen("")


b'<!DOCTYPE html<\r\n
<!--[if IE 8]<
<html class="ie ie8"<
\r\n<!--[if IE 9]<
<html class="ie ie9"<
<![endif]-->\r\n<!--[if gt IE 9]><!--<
\r\n<html lang="en-US"<
\r\n<head>\r\n<!-- Basic --<
\r\n<meta charset="utf-8"<
\r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title<
\r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/<
\r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n
<meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="" rel="stylesheet" type="text/css" /<\r\n
<link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n
<script src="/questions/js/jquery.min.js"<
\r\n<script src="/questions/js/fontawesome.js"<
<script src=""<
<!-- Start of Body Content --> \r\n
<div class="mui-appbar-home">\r\n
<div class="mui-container">\r\n
<div class="tp-primary-header mui-top-home">\r\n
<a href="" target="_blank" title="TutorialsPoint - Home">
<i class="fa fa-home">
<div class="tp-primary-header mui-top-qa">\r\n
<a href="" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i>
<div class="tp-primary-header mui-top-tools">\r\n
<a href="" target="_blank" title="Tools - Online Development and Testing Tools">
<i class="fa fa-cogs"></i><span>Tools</span></a>\r\n
<div class="tp-primary-header mui-top-coding-ground">\r\n
<a href="" target="_blank" title="Coding Ground - Free Online IDE and Terminal">
<i class="fa fa-code">
Coding Ground </span>
</a> \r\n
<div class="tp-primary-header mui-top-current-affairs">\r\n
<a href="" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe">
</i><span>Current Affairs</span>
<div class="tp-primary-header mui-top-upsc">\r\n
<a href="" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n
<div class="tp-primary-header mui-top-tutors">\r\n
<a href="" target="_blank" title="Top Online Tutors - Tutor Connect">
<i class="fa fa-user">
<span>Online Tutors</span>
<div class="tp-primary-header mui-top-examples">\r\n

Requests − This module is not preinstalled, we have to write the command line in command prompt.Requests send request to HTTP/1.1.

pip install requests


import requests
# get URL
my_req = requests.get('')


text/html; charset=UTF-8

BeautifulSoup − This is a parsing library which is used in different parsers. Python’s standard library provides BeautifulSoup’s default parser. It builts a parser tree which is used to extract data from HTML page.

For installing this module, we write command line in command prompt.

pip install beautifulsoup4


from bs4 import BeautifulSoup
# importing requests
import requests
# get URL
my_req = requests.get("")
my_data = my_req.text
my_soup = BeautifulSoup(my_data)
for my_link in my_soup.find_all('a'):


Lxml − This is a parsing library, high-performance, production-quality HTML and XML parsing library. If we want high-quality, maximum speed, then we have to use this library. It has many module by which we can extract data from web site.

For installing we write in Command prompt

pip install lxml


from lxml import etree
my_root_elem = etree.Element('html')
etree.SubElement(my_root_elem, 'head')
etree.SubElement(my_root_elem, 'title')
etree.SubElement(my_root_elem, 'body')
print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))



Selenium − This is an automates browsers tool, it is also known as web-drivers. When we use any website,we observe that sometimes we have to wait for some time, for example when we click any button or scrolling the page, in this moment Selenium is needed.

For installing selenium we use this command

pip install selenium


from selenium import webdriver
my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver'
my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver)
my_url = ''



MechanicalSoup − This is another Python library for automating interaction with websites. By using this we can automatically store and send cookies, can follow redirects, and can follow links and submit forms. It doesn’t do JavaScript.

For installing we can use following command

pip install MechanicalSoup


import mechanicalsoup
my_browser = mechanicalsoup.StatefulBrowser()
my_value ="")
my_val = my_browser.get_url()
my_va = my_browser.follow_link("forms")
my_value1 = my_browser.get_url()
Published on 08-Nov-2018 15:15:05