HyperText Markup Language support in Python?

Python has the capability to process HTML files through the HTMLParser class in the html.parser module. It can detect the nature of HTML tags, their position, and many other properties. It has functions which can also identify and fetch the data present in an HTML file.

The HTMLParser class allows you to create custom parser classes that can process only the tags and data that you define. You can handle start tags, end tags, and text data between tags.

Basic HTML File Structure

Let's start with a simple HTML file that we'll parse ?

<html>
<head>
<title>Welcome to Tutorials Point!</title>
</head>
<body>
<h1>Learn anything!</h1>
</body>
</html>

Creating a Custom HTML Parser

Below is a program that creates a custom parser to process HTML content ?

from html.parser import HTMLParser

class CustomParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Line and Offset ==", self.getpos())
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Line and Offset ==", self.getpos())
        print("Encountered an end tag:", tag)

    def handle_data(self, data):
        print("Line and Offset ==", self.getpos())
        print("Encountered some data:", repr(data))

# Create parser instance and process HTML
parser = CustomParser()

html_content = """<html>
<head>
<title>Welcome to Tutorials Point!</title>
</head>
<body>
<h1>Learn anything!</h1>
</body>
</html>"""

parser.feed(html_content)

The output of the above code is ?

Line and Offset == (1, 0)
Encountered a start tag: html
Line and Offset == (1, 6)
Encountered some data: '\n'
Line and Offset == (2, 0)
Encountered a start tag: head
Line and Offset == (2, 6)
Encountered some data: '\n'
Line and Offset == (3, 0)
Encountered a start tag: title
Line and Offset == (3, 7)
Encountered some data: 'Welcome to Tutorials Point!'
Line and Offset == (3, 34)
Encountered an end tag: title
Line and Offset == (3, 42)
Encountered some data: '\n'
Line and Offset == (4, 0)
Encountered an end tag: head
Line and Offset == (4, 7)
Encountered some data: '\n'
Line and Offset == (5, 0)
Encountered a start tag: body
Line and Offset == (5, 6)
Encountered some data: '\n'
Line and Offset == (6, 0)
Encountered a start tag: h1
Line and Offset == (6, 4)
Encountered some data: 'Learn anything!'
Line and Offset == (6, 20)
Encountered an end tag: h1
Line and Offset == (6, 25)
Encountered some data: '\n'
Line and Offset == (7, 0)
Encountered an end tag: body
Line and Offset == (7, 7)
Encountered some data: '\n'
Line and Offset == (8, 0)
Encountered an end tag: html

Extracting Specific Content

You can also create a parser that extracts specific content like title and headings ?

from html.parser import HTMLParser

class ContentExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.current_tag = None
        self.title = ""
        self.headings = []

    def handle_starttag(self, tag, attrs):
        self.current_tag = tag

    def handle_endtag(self, tag):
        self.current_tag = None

    def handle_data(self, data):
        if self.current_tag == "title":
            self.title = data.strip()
        elif self.current_tag in ["h1", "h2", "h3"]:
            self.headings.append(data.strip())

# Parse HTML and extract content
extractor = ContentExtractor()
html_content = """<html>
<head>
<title>Python HTML Parsing</title>
</head>
<body>
<h1>Main Heading</h1>
<h2>Sub Heading</h2>
</body>
</html>"""

extractor.feed(html_content)

print("Title:", extractor.title)
print("Headings:", extractor.headings)

The output of the above code is ?

Title: Python HTML Parsing
Headings: ['Main Heading', 'Sub Heading']

Key Methods

Method Purpose Parameters
handle_starttag() Process opening tags tag, attributes
handle_endtag() Process closing tags tag
handle_data() Process text content data
getpos() Get line and offset None

Conclusion

Python's HTMLParser provides a powerful way to parse HTML content by overriding specific methods. Use it to extract data, analyze structure, or transform HTML documents programmatically.

Updated on: 2026-03-15T18:35:29+05:30

342 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements