Article Categories

Selected Reading

HyperText Markup Language support in Python?

Python Server Side Programming Programming

Python has the capability to process HTML files through the HTMLParser class in the html.parser module. It can detect the nature of HTML tags, their position, and many other properties. It has functions which can also identify and fetch the data present in an HTML file.

The HTMLParser class allows you to create custom parser classes that can process only the tags and data that you define. You can handle start tags, end tags, and text data between tags.

Basic HTML File Structure

Let's start with a simple HTML file that we'll parse ?

<html>
<head>
<title>Welcome to Tutorials Point!</title>
</head>
<body>
<h1>Learn anything!</h1>
</body>
</html>

Creating a Custom HTML Parser

Below is a program that creates a custom parser to process HTML content ?

from html.parser import HTMLParser

class CustomParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Line and Offset ==", self.getpos())
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Line and Offset ==", self.getpos())
        print("Encountered an end tag:", tag)

    def handle_data(self, data):
        print("Line and Offset ==", self.getpos())
        print("Encountered some data:", repr(data))

# Create parser instance and process HTML
parser = CustomParser()

html_content = """<html>
<head>
<title>Welcome to Tutorials Point!</title>
</head>
<body>
<h1>Learn anything!</h1>
</body>
</html>"""

parser.feed(html_content)

The output of the above code is ?

Line and Offset == (1, 0)
Encountered a start tag: html
Line and Offset == (1, 6)
Encountered some data: '\n'
Line and Offset == (2, 0)
Encountered a start tag: head
Line and Offset == (2, 6)
Encountered some data: '\n'
Line and Offset == (3, 0)
Encountered a start tag: title
Line and Offset == (3, 7)
Encountered some data: 'Welcome to Tutorials Point!'
Line and Offset == (3, 34)
Encountered an end tag: title
Line and Offset == (3, 42)
Encountered some data: '\n'
Line and Offset == (4, 0)
Encountered an end tag: head
Line and Offset == (4, 7)
Encountered some data: '\n'
Line and Offset == (5, 0)
Encountered a start tag: body
Line and Offset == (5, 6)
Encountered some data: '\n'
Line and Offset == (6, 0)
Encountered a start tag: h1
Line and Offset == (6, 4)
Encountered some data: 'Learn anything!'
Line and Offset == (6, 20)
Encountered an end tag: h1
Line and Offset == (6, 25)
Encountered some data: '\n'
Line and Offset == (7, 0)
Encountered an end tag: body
Line and Offset == (7, 7)
Encountered some data: '\n'
Line and Offset == (8, 0)
Encountered an end tag: html

Extracting Specific Content

You can also create a parser that extracts specific content like title and headings ?

from html.parser import HTMLParser

class ContentExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.current_tag = None
        self.title = ""
        self.headings = []

    def handle_starttag(self, tag, attrs):
        self.current_tag = tag

    def handle_endtag(self, tag):
        self.current_tag = None

    def handle_data(self, data):
        if self.current_tag == "title":
            self.title = data.strip()
        elif self.current_tag in ["h1", "h2", "h3"]:
            self.headings.append(data.strip())

# Parse HTML and extract content
extractor = ContentExtractor()
html_content = """<html>
<head>
<title>Python HTML Parsing</title>
</head>
<body>
<h1>Main Heading</h1>
<h2>Sub Heading</h2>
</body>
</html>"""

extractor.feed(html_content)

print("Title:", extractor.title)
print("Headings:", extractor.headings)

The output of the above code is ?

Title: Python HTML Parsing
Headings: ['Main Heading', 'Sub Heading']

Key Methods

Method	Purpose	Parameters
`handle_starttag()`	Process opening tags	tag, attributes
`handle_endtag()`	Process closing tags	tag
`handle_data()`	Process text content	data
`getpos()`	Get line and offset	None

Conclusion

Python's HTMLParser provides a powerful way to parse HTML content by overriding specific methods. Use it to extract data, analyze structure, or transform HTML documents programmatically.

Pradeep Elance

Updated on: 2026-03-15T18:35:29+05:30

372 Views

Previous Next