Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
HyperText Markup Language support in Python?
Python has the capability to process HTML files through the HTMLParser class in the html.parser module. It can detect the nature of HTML tags, their position, and many other properties. It has functions which can also identify and fetch the data present in an HTML file.
The HTMLParser class allows you to create custom parser classes that can process only the tags and data that you define. You can handle start tags, end tags, and text data between tags.
Basic HTML File Structure
Let's start with a simple HTML file that we'll parse ?
<html> <head> <title>Welcome to Tutorials Point!</title> </head> <body> <h1>Learn anything!</h1> </body> </html>
Creating a Custom HTML Parser
Below is a program that creates a custom parser to process HTML content ?
from html.parser import HTMLParser
class CustomParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Line and Offset ==", self.getpos())
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Line and Offset ==", self.getpos())
print("Encountered an end tag:", tag)
def handle_data(self, data):
print("Line and Offset ==", self.getpos())
print("Encountered some data:", repr(data))
# Create parser instance and process HTML
parser = CustomParser()
html_content = """<html>
<head>
<title>Welcome to Tutorials Point!</title>
</head>
<body>
<h1>Learn anything!</h1>
</body>
</html>"""
parser.feed(html_content)
The output of the above code is ?
Line and Offset == (1, 0) Encountered a start tag: html Line and Offset == (1, 6) Encountered some data: '\n' Line and Offset == (2, 0) Encountered a start tag: head Line and Offset == (2, 6) Encountered some data: '\n' Line and Offset == (3, 0) Encountered a start tag: title Line and Offset == (3, 7) Encountered some data: 'Welcome to Tutorials Point!' Line and Offset == (3, 34) Encountered an end tag: title Line and Offset == (3, 42) Encountered some data: '\n' Line and Offset == (4, 0) Encountered an end tag: head Line and Offset == (4, 7) Encountered some data: '\n' Line and Offset == (5, 0) Encountered a start tag: body Line and Offset == (5, 6) Encountered some data: '\n' Line and Offset == (6, 0) Encountered a start tag: h1 Line and Offset == (6, 4) Encountered some data: 'Learn anything!' Line and Offset == (6, 20) Encountered an end tag: h1 Line and Offset == (6, 25) Encountered some data: '\n' Line and Offset == (7, 0) Encountered an end tag: body Line and Offset == (7, 7) Encountered some data: '\n' Line and Offset == (8, 0) Encountered an end tag: html
Extracting Specific Content
You can also create a parser that extracts specific content like title and headings ?
from html.parser import HTMLParser
class ContentExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.current_tag = None
self.title = ""
self.headings = []
def handle_starttag(self, tag, attrs):
self.current_tag = tag
def handle_endtag(self, tag):
self.current_tag = None
def handle_data(self, data):
if self.current_tag == "title":
self.title = data.strip()
elif self.current_tag in ["h1", "h2", "h3"]:
self.headings.append(data.strip())
# Parse HTML and extract content
extractor = ContentExtractor()
html_content = """<html>
<head>
<title>Python HTML Parsing</title>
</head>
<body>
<h1>Main Heading</h1>
<h2>Sub Heading</h2>
</body>
</html>"""
extractor.feed(html_content)
print("Title:", extractor.title)
print("Headings:", extractor.headings)
The output of the above code is ?
Title: Python HTML Parsing Headings: ['Main Heading', 'Sub Heading']
Key Methods
| Method | Purpose | Parameters |
|---|---|---|
handle_starttag() |
Process opening tags | tag, attributes |
handle_endtag() |
Process closing tags | tag |
handle_data() |
Process text content | data |
getpos() |
Get line and offset | None |
Conclusion
Python's HTMLParser provides a powerful way to parse HTML content by overriding specific methods. Use it to extract data, analyze structure, or transform HTML documents programmatically.
