Python Tools for Web scraping

Python Programming Server Side Programming

In computer science Web scraping means extracting data from websites. Using this technique transform the unstructured data on the web into structured data.

Most common web Scraping tools In Python3 are −

Urllib2
Requests
BeautifulSoup
Lxml
Selenium
MechanicalSoup

Urllib2 − This tool is pre-installed with Python. This module is used for extracting the URL's. Using urlopen () function fetching the URL's using different protocols (FTP, HTTPetc.).

Example code

from urllib.request import urlopen
my_html = urlopen("https://www.tutorialspoint.com/")
print(my_html.read())

Output

b'<!DOCTYPE html<\r\n
<!--[if IE 8]<
<html class="ie ie8"<
<![endif]--<
\r\n<!--[if IE 9]<
<html class="ie ie9"<
<![endif]-->\r\n<!--[if gt IE 9]><!--<
\r\n<html lang="en-US"<
<!--<![endif]--<
\r\n<head>\r\n<!-- Basic --<
\r\n<meta charset="utf-8"<
\r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title<
\r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/<
\r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n
<meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n
<link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n
<script src="/questions/js/jquery.min.js"<
</script<
\r\n<script src="/questions/js/fontawesome.js"<
</script<\r\n
<script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"<
</script>\r\n
</head>\r\n
<body>\r\n
<!-- Start of Body Content --> \r\n
<div class="mui-appbar-home">\r\n
<div class="mui-container">\r\n
<div class="tp-primary-header mui-top-home">\r\n
<a href="https://www.tutorialspoint.com/index.htm" target="_blank" title="TutorialsPoint - Home">
<i class="fa fa-home">
</i><span>Home</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-qa">\r\n
<a href="https://www.tutorialspoint.com/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i>
<span>
Q/A</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tools">\r\n
<a href="https://www.tutorialspoint.com/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools">
<i class="fa fa-cogs"></i><span>Tools</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-coding-ground">\r\n
<a href="https://www.tutorialspoint.com/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal">
<i class="fa fa-code">
</i>
<span>
Coding Ground </span>
</a> \r\n
</div>\r\n
<div class="tp-primary-header mui-top-current-affairs">\r\n
<a href="https://www.tutorialspoint.com/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe">
</i><span>Current Affairs</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-upsc">\r\n
<a href="https://www.tutorialspoint.com/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tutors">\r\n
<a href="https://www.tutorialspoint.com/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect">
<i class="fa fa-user">
</i>
<span>Online Tutors</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-examples">\r\n
….

Requests − This module is not preinstalled, we have to write the command line in command prompt.Requests send request to HTTP/1.1.

pip install requests

Example

import requests
# get URL
my_req = requests.get('https://www.tutorialspoint.com/')
   print(my_req.encoding)
   print(my_req.status_code)
   print(my_req.elapsed)
   print(my_req.url)
   print(my_req.history)
print(my_req.headers['Content-Type'])

Output

UTF-8
200
0:00:00.205727
https://www.tutorialspoint.com/
[]
text/html; charset=UTF-8

BeautifulSoup − This is a parsing library which is used in different parsers. Python’s standard library provides BeautifulSoup’s default parser. It builts a parser tree which is used to extract data from HTML page.

For installing this module, we write command line in command prompt.

pip install beautifulsoup4

Example

from bs4 import BeautifulSoup
# importing requests
import requests
# get URL
my_req = requests.get("https://www.tutorialspoint.com/")
my_data = my_req.text
my_soup = BeautifulSoup(my_data)
for my_link in my_soup.find_all('a'):
print(my_link.get('href'))

Output

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/questions/index.php
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/current_affairs/index.htm
https://www.tutorialspoint.com/upsc_ias_exams.htm
https://www.tutorialspoint.com/tutor_connect/index.php
https://www.tutorialspoint.com/programming_examples/
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/articles/
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/videotutorials/index.htm
https://store.tutorialspoint.com
https://www.tutorialspoint.com/html_online_training/index.asp
https://www.tutorialspoint.com/css_online_training/index.asp
https://www.tutorialspoint.com/3d_animation_online_training/index.asp
https://www.tutorialspoint.com/swift_4_online_training/index.asp
https://www.tutorialspoint.com/blockchain_online_training/index.asp
https://www.tutorialspoint.com/reactjs_online_training/index.asp
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/computer_fundamentals/index.htm
https://www.tutorialspoint.com/compiler_design/index.htm
https://www.tutorialspoint.com/operating_system/index.htm
https://www.tutorialspoint.com/data_structures_algorithms/index.htm
https://www.tutorialspoint.com/dbms/index.htm
https://www.tutorialspoint.com/data_communication_computer_network/index.htm
https://www.tutorialspoint.com/academic_tutorials.htm
https://www.tutorialspoint.com/html/index.htm
https://www.tutorialspoint.com/css/index.htm
https://www.tutorialspoint.com/javascript/index.htm
https://www.tutorialspoint.com/php/index.htm
https://www.tutorialspoint.com/angular4/index.htm
https://www.tutorialspoint.com/mysql/index.htm
https://www.tutorialspoint.com/web_development_tutorials.htm
https://www.tutorialspoint.com/cprogramming/index.htm
https://www.tutorialspoint.com/cplusplus/index.htm
https://www.tutorialspoint.com/java8/index.htm
https://www.tutorialspoint.com/python/index.htm
https://www.tutorialspoint.com/scala/index.htm
https://www.tutorialspoint.com/csharp/index.htm
https://www.tutorialspoint.com/computer_programming_tutorials.htm
https://www.tutorialspoint.com/java8/index.htm
https://www.tutorialspoint.com/jdbc/index.htm
https://www.tutorialspoint.com/servlets/index.htm
https://www.tutorialspoint.com/spring/index.htm
https://www.tutorialspoint.com/hibernate/index.htm
https://www.tutorialspoint.com/swing/index.htm
https://www.tutorialspoint.com/java_technology_tutorials.htm
https://www.tutorialspoint.com/android/index.htm
https://www.tutorialspoint.com/swift/index.htm
https://www.tutorialspoint.com/ios/index.htm
https://www.tutorialspoint.com/kotlin/index.htm
https://www.tutorialspoint.com/react_native/index.htm
https://www.tutorialspoint.com/xamarin/index.htm
https://www.tutorialspoint.com/mobile_development_tutorials.htm
https://www.tutorialspoint.com/mongodb/index.htm
https://www.tutorialspoint.com/plsql/index.htm
https://www.tutorialspoint.com/sql/index.htm
https://www.tutorialspoint.com/db2/index.htm
https://www.tutorialspoint.com/mysql/index.htm
https://www.tutorialspoint.com/memcached/index.htm
https://www.tutorialspoint.com/database_tutorials.htm
https://www.tutorialspoint.com/asp.net/index.htm
https://www.tutorialspoint.com/entity_framework/index.htm
https://www.tutorialspoint.com/vb.net/index.htm
https://www.tutorialspoint.com/ms_project/index.htm
https://www.tutorialspoint.com/excel/index.htm
https://www.tutorialspoint.com/word/index.htm
https://www.tutorialspoint.com/microsoft_technologies_tutorials.htm
https://www.tutorialspoint.com/big_data_analytics/index.htm
https://www.tutorialspoint.com/hadoop/index.htm
https://www.tutorialspoint.com/sas/index.htm
https://www.tutorialspoint.com/qlikview/index.htm
https://www.tutorialspoint.com/power_bi/index.htm
https://www.tutorialspoint.com/tableau/index.htm
https://www.tutorialspoint.com/big_data_tutorials.htm
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/coding_platform_for_websites.htm
https://www.tutorialspoint.com/developers_best_practices/index.htm
https://www.tutorialspoint.com/effective_resume_writing.htm
https://www.tutorialspoint.com/computer_glossary.htm
https://www.tutorialspoint.com/computer_whoiswho.htm
https://www.tutorialspoint.com/questions_and_answers.htm
https://www.tutorialspoint.com/multi_language_tutorials.htm
https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8
https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer
http://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09
/about/index.htm
/about/about_team.htm
/about/about_careers.htm
/about/about_privacy.htm
/about/about_terms_of_use.htm
https://www.tutorialspoint.com/articles/
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/free_web_graphics.htm
https://www.tutorialspoint.com/online_file_conversion.htm
https://www.tutorialspoint.com/shared-tutorials.php
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorialspoint.com/free_online_whiteboard.htm
http://www.tutorialspoint.com
https://www.facebook.com/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint
http://www.twitter.com/tutorialspoint
http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

Lxml − This is a parsing library, high-performance, production-quality HTML and XML parsing library. If we want high-quality, maximum speed, then we have to use this library. It has many module by which we can extract data from web site.

For installing we write in Command prompt

pip install lxml

Example

from lxml import etree
my_root_elem = etree.Element('html')
etree.SubElement(my_root_elem, 'head')
etree.SubElement(my_root_elem, 'title')
etree.SubElement(my_root_elem, 'body')
print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))

Output

<html>
<head/>
<title/>
<body/>
</html>

Selenium − This is an automates browsers tool, it is also known as web-drivers. When we use any website,we observe that sometimes we have to wait for some time, for example when we click any button or scrolling the page, in this moment Selenium is needed.

For installing selenium we use this command

pip install selenium

Example

from selenium import webdriver
my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver'
my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver)
my_url = 'https://www.tutorialspoint.com/'
my_browser.get(my_url)

Output

tutorialspoint

MechanicalSoup − This is another Python library for automating interaction with websites. By using this we can automatically store and send cookies, can follow redirects, and can follow links and submit forms. It doesn’t do JavaScript.

For installing we can use following command

pip install MechanicalSoup

Example

import mechanicalsoup
my_browser = mechanicalsoup.StatefulBrowser()
my_value = my_browser.open("https://www.tutorialspoint.com/")
print(my_value)
my_val = my_browser.get_url()
print(my_val)
my_va = my_browser.follow_link("forms")
print(my_va)
my_value1 = my_browser.get_url()
print(my_value1)

Samual Sam

Learning faster. Every day.

Updated on: 26-Jun-2020

211 Views

Kickstart Your Career

Get certified by completing the course

Get Started