How to Parse HTML pages to fetch HTML tables with Python?


Problem

You need to extract the HTML tables from a web page.

Introduction

The internet, and the World Wide Web (WWW), is the most prominent source of information today. There is so much information out there, it is just very hard to choose the content from so many options. Most of that information is retrievable through HTTP.

But we can also perform these operations programmatically to retrieve and process information automatically.

Python allows us to do this using its standard library an HTTP client, but the requests module helps in obtaining web pages information very easy.

In this post, we will see how to parse through the HTML pages to extract HTML tables embedded in the pages.

How to do it..

1.We will be using requests, pandas, beautifulsoup4 and tabulate packages. PLease install them on your system if they are missing. If you are unsure, please use pip freeze to validate.

import requests
import pandas as pd
from tabulate import tabulate

2.We will use the https://www.tutorialspoint.com/python/python_basic_operators.htm to prase through the page and print out all the HTML pages embedded inside them.

# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

3.We will make a request to the server and see the response.

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

4.Well, response code 200 - represents the response back from server is success. So we will now check the request headers, reponse headers as well as first 100 text returned by the server.

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

Output

*** Printing the request headers -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

5.We will now use BeautifulSoup to parse through the HTML.

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

6.Well, most of the tables will have heading defined either in h2, h3, h4, h5 or h6 tags. We first identify these tags then we pickup the html table next to the tag identified. For this logic, we will use find, sibling and find_next_siblings as defined below.

# Find all the h3 elements
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))

Full code

7.Putting it all together now.

# STEP1 : Download the page required
import requests
import pandas as pd


# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

# Find all the h3 elements
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)

Output

*** The response for https://www.tutorialspoint.com/python/python_basic_operators.htm is 200
*** Printing the request headers -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - Python - Basic Operators - Tutorialspoint
[<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>]
[ Operator Description \
0 + Addition Adds values on either side of the operator.
1 - Subtraction Subtracts right hand operand from left hand op...
2 * Multiplication Multiplies values on either side of the operator
3 / Division Divides left hand operand by right hand operand
4 % Modulus Divides left hand operand by right hand operan...
5 ** Exponent Performs exponential (power) calculation on op...
6 // Floor Division - The division of operands wher...

Example

0 a + b = 30
1 a – b = -10
2 a * b = 200
3 b / a = 2
4 b % a = 0
5 a**b =10 to the power 20
6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ]
[ Operator Description \
0 == If the values of two operands are equal, then ...
1 != If values of two operands are not equal, then ...
2 <> If values of two operands are not equal, then ...
3 > If the value of left operand is greater than t...
4 < If the value of left operand is less than the ...
5 >= If the value of left operand is greater than o...
6 <= If the value of left operand is less than or e...

Example

0 (a == b) is not true.
1 (a != b) is true.
2 (a <> b) is true. This is similar to != operator.
3 (a > b) is not true.
4 (a < b) is true.
5 (a >= b) is not true.
6 (a <= b) is true. ]
[ Operator Description \
0 = Assigns values from right side operands to lef...
1 += Add AND It adds right operand to the left operand and ...
2 -= Subtract AND It subtracts right operand from the left opera...
3 *= Multiply AND It multiplies right operand with the left oper...
4 /= Divide AND It divides left operand with the right operand...
5 %= Modulus AND It takes modulus using two operands and assign...
6 **= Exponent AND Performs exponential (power) calculation on op...
7 //= Floor Division It performs floor division on operators and as...

Example

0 c = a + b assigns value of a + b into c
1 c += a is equivalent to c = c + a
2 c -= a is equivalent to c = c - a
3 c *= a is equivalent to c = c * a
4 c /= a is equivalent to c = c / a
5 c %= a is equivalent to c = c % a
6 c **= a is equivalent to c = c ** a
7 c //= a is equivalent to c = c // a ]
[ Operator \
0 & Binary AND
1 | Binary OR
2 ^ Binary XOR
3 ~ Binary Ones Complement
4 << Binary Left Shift
5 >> Binary Right Shift

Description \
0 Operator copies a bit to the result if it exis...
1 It copies a bit if it exists in either operand.
2 It copies the bit if it is set in one operand ...
3 It is unary and has the effect of 'flipping' b...
4 The left operands value is moved left by the n...
5 The left operands value is moved right by the ...

Example

0 (a & b) (means 0000 1100)
1 (a | b) = 61 (means 0011 1101)
2 (a ^ b) = 49 (means 0011 0001)
3 (~a ) = -61 (means 1100 0011 in 2's complement...
4 a << 2 = 240 (means 1111 0000)
5 a >> 2 = 15 (means 0000 1111) ]
[ Operator Description \
0 and Logical AND If both the operands are true then condition b...
1 or Logical OR If any of the two operands are non-zero then c...
2 not Logical NOT Used to reverse the logical state of its operand.

Example
0 (a and b) is true.
1 (a or b) is true.
2 Not(a and b) is false. ]
[ Operator Description \
0 in Evaluates to true if it finds a variable in th...
1 not in Evaluates to true if it does not finds a varia...

Example

0 x in y, here in results in a 1 if x is a membe...
1 x not in y, here not in results in a 1 if x is... ]
[ Operator Description \
0 is Evaluates to true if the variables on either s...
1 is not Evaluates to false if the variables on either ...

Example

0 x is y, here is results in 1 if id(x) equals i...
1 x is not y, here is not results in 1 if id(x) ... ]
[ Sr.No. Operator & Description
0 1 ** Exponentiation (raise to the power)
1 2 ~ + - Complement, unary plus and minus (method...
2 3 * / % // Multiply, divide, modulo and floor di...
3 4 + - Addition and subtraction
4 5 >> << Right and left bitwise shift
5 6 & Bitwise 'AND'
6 7 ^ | Bitwise exclusive `OR' and regular `OR'
7 8 <= < > >= Comparison operators
8 9 <> == != Equality operators
9 10 = %= /= //= -= += *= **= Assignment operators
10 11 is is not Identity operators
11 12 in not in]

Updated on: 09-Nov-2020

596 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements