How to Parse HTML pages to fetch HTML tables with Python?

Extracting HTML tables from web pages is a common task in web scraping and data analysis. Python provides powerful libraries like requests, BeautifulSoup, and pandas to make this process straightforward.

Required Libraries

First, install the necessary packages if they're not already available ?

pip install requests beautifulsoup4 pandas tabulate

Basic Setup

Import the required libraries and set up the target URL ?

import requests
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate

# Set the target URL
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

Making HTTP Request

Send a GET request to fetch the webpage content ?

# Make a request to the server
response = requests.get(site_url)

# Check the response status
print(f"Response status code: {response.status_code}")
print(f"Content length: {len(response.text)} characters")
print(f"First 100 characters:\n{response.text[:100]}")
Response status code: 200
Content length: 37624 characters
First 100 characters:
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

Parsing HTML with BeautifulSoup

Parse the HTML content and extract basic information ?

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract page title
title = soup.title.string
print(f"Page title: {title}")

# Find all heading tags that might precede tables
headings = soup.find_all(['h2', 'h3', 'h4', 'h5', 'h6'])
print(f"Found {len(headings)} heading tags")
Page title: Python - Basic Operators - Tutorialspoint
Found 9 heading tags

Extracting Tables

Find and extract HTML tables from the webpage ?

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Fetch and parse the webpage
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"
response = requests.get(site_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all tables in the page
tables = soup.find_all('table')
print(f"Found {len(tables)} tables on the page")

# Extract the first table as an example
if tables:
    first_table = tables[0]
    
    # Convert HTML table to pandas DataFrame
    df = pd.read_html(str(first_table))[0]
    
    # Display the table
    print("\nFirst table content:")
    print(df.head())
Found 10 tables on the page

First table content:
    Operator                                        Description
0          +              Adds values on either side of the operator.
1          -  Subtracts right hand operand from left hand op...
2          *    Multiplies values on either side of the operator
3          /   Divides left hand operand by right hand operand
4          %  Divides left hand operand by right hand operan...

Complete Table Extraction Function

Here's a comprehensive function to extract all tables from a webpage ?

def extract_tables_from_url(url):
    """Extract all HTML tables from a given URL"""
    
    # Fetch the webpage
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch page. Status code: {response.status_code}")
        return []
    
    # Parse HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all tables
    tables = soup.find_all('table')
    extracted_tables = []
    
    for i, table in enumerate(tables):
        try:
            # Convert to DataFrame
            df = pd.read_html(str(table))[0]
            extracted_tables.append({
                'table_index': i + 1,
                'dataframe': df,
                'shape': df.shape
            })
        except Exception as e:
            print(f"Error processing table {i + 1}: {e}")
    
    return extracted_tables

# Example usage
url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"
all_tables = extract_tables_from_url(url)

print(f"Successfully extracted {len(all_tables)} tables")
for table_info in all_tables[:2]:  # Show first 2 tables
    print(f"\nTable {table_info['table_index']} - Shape: {table_info['shape']}")
    print(table_info['dataframe'].head(3))
Successfully extracted 10 tables

Table 1 - Shape: (7, 2)
    Operator                                        Description
0          +              Adds values on either side of the operator.
1          -  Subtracts right hand operand from left hand op...
2          *    Multiplies values on either side of the operator

Table 2 - Shape: (7, 1)
                                                    0
0                                          a + b = 30
1                                         a - b = -10
2                                        a * b = 200

Comparison of Methods

Method Best For Pros Cons
pd.read_html() Simple table extraction One-line solution Less control over parsing
BeautifulSoup + pandas Complex parsing needs Full HTML control More code required
Manual BeautifulSoup Custom table structures Maximum flexibility Most complex to implement

Conclusion

Use requests and BeautifulSoup to fetch and parse HTML pages, then pd.read_html() to convert tables to DataFrames. This combination provides both flexibility and simplicity for web scraping tasks.

Updated on: 2026-03-25T11:51:21+05:30

900 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements