Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Parse HTML pages to fetch HTML tables with Python?
Extracting HTML tables from web pages is a common task in web scraping and data analysis. Python provides powerful libraries like requests, BeautifulSoup, and pandas to make this process straightforward.
Required Libraries
First, install the necessary packages if they're not already available ?
pip install requests beautifulsoup4 pandas tabulate
Basic Setup
Import the required libraries and set up the target URL ?
import requests import pandas as pd from bs4 import BeautifulSoup from tabulate import tabulate # Set the target URL site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"
Making HTTP Request
Send a GET request to fetch the webpage content ?
# Make a request to the server
response = requests.get(site_url)
# Check the response status
print(f"Response status code: {response.status_code}")
print(f"Content length: {len(response.text)} characters")
print(f"First 100 characters:\n{response.text[:100]}")
Response status code: 200 Content length: 37624 characters First 100 characters: <!DOCTYPE html> <html lang="en-US"> <head> <title>Python - Basic Operators - Tutorialspoint</title>
Parsing HTML with BeautifulSoup
Parse the HTML content and extract basic information ?
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract page title
title = soup.title.string
print(f"Page title: {title}")
# Find all heading tags that might precede tables
headings = soup.find_all(['h2', 'h3', 'h4', 'h5', 'h6'])
print(f"Found {len(headings)} heading tags")
Page title: Python - Basic Operators - Tutorialspoint Found 9 heading tags
Extracting Tables
Find and extract HTML tables from the webpage ?
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Fetch and parse the webpage
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"
response = requests.get(site_url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all tables in the page
tables = soup.find_all('table')
print(f"Found {len(tables)} tables on the page")
# Extract the first table as an example
if tables:
first_table = tables[0]
# Convert HTML table to pandas DataFrame
df = pd.read_html(str(first_table))[0]
# Display the table
print("\nFirst table content:")
print(df.head())
Found 10 tables on the page
First table content:
Operator Description
0 + Adds values on either side of the operator.
1 - Subtracts right hand operand from left hand op...
2 * Multiplies values on either side of the operator
3 / Divides left hand operand by right hand operand
4 % Divides left hand operand by right hand operan...
Complete Table Extraction Function
Here's a comprehensive function to extract all tables from a webpage ?
def extract_tables_from_url(url):
"""Extract all HTML tables from a given URL"""
# Fetch the webpage
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to fetch page. Status code: {response.status_code}")
return []
# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all tables
tables = soup.find_all('table')
extracted_tables = []
for i, table in enumerate(tables):
try:
# Convert to DataFrame
df = pd.read_html(str(table))[0]
extracted_tables.append({
'table_index': i + 1,
'dataframe': df,
'shape': df.shape
})
except Exception as e:
print(f"Error processing table {i + 1}: {e}")
return extracted_tables
# Example usage
url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"
all_tables = extract_tables_from_url(url)
print(f"Successfully extracted {len(all_tables)} tables")
for table_info in all_tables[:2]: # Show first 2 tables
print(f"\nTable {table_info['table_index']} - Shape: {table_info['shape']}")
print(table_info['dataframe'].head(3))
Successfully extracted 10 tables
Table 1 - Shape: (7, 2)
Operator Description
0 + Adds values on either side of the operator.
1 - Subtracts right hand operand from left hand op...
2 * Multiplies values on either side of the operator
Table 2 - Shape: (7, 1)
0
0 a + b = 30
1 a - b = -10
2 a * b = 200
Comparison of Methods
| Method | Best For | Pros | Cons |
|---|---|---|---|
pd.read_html() |
Simple table extraction | One-line solution | Less control over parsing |
| BeautifulSoup + pandas | Complex parsing needs | Full HTML control | More code required |
| Manual BeautifulSoup | Custom table structures | Maximum flexibility | Most complex to implement |
Conclusion
Use requests and BeautifulSoup to fetch and parse HTML pages, then pd.read_html() to convert tables to DataFrames. This combination provides both flexibility and simplicity for web scraping tasks.
