html.parser — Simple HTML and XHTML parser in Python

The html.parser module in Python's standard library provides the HTMLParser class for parsing HTML and XHTML documents. This class contains handler methods that can identify tags, data, comments, and other HTML elements.

To use HTMLParser, create a subclass that inherits from HTMLParser and override specific handler methods to process different HTML elements.

Basic HTMLParser Setup

Here's the basic structure for creating a custom HTML parser ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    pass

parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com"></a>')

Handling Start Tags

The handle_starttag(tag, attrs) method is called when the parser encounters an opening HTML tag. The tag name is converted to lowercase, and attributes are provided as a list of tuples ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("  attr:", attr)

parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com" class="link">')
Start tag: a
  attr: ('href', 'https://www.tutorialspoint.com')
  attr: ('class', 'link')

Handling End Tags and Data

Override handle_endtag(tag) for closing tags and handle_data(data) for text content between tags ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
    
    def handle_endtag(self, tag):
        print("End tag:", tag)
    
    def handle_data(self, data):
        if data.strip():  # Only print non-empty data
            print("Data:", data.strip())

parser = MyParser()
html = '''
<html>
   <body>
      <h1>TutorialsPoint</h1>
      <b>Python Standard Library</b>
      <p>HTML Parser Module</p>
   </body>
</html>'''

parser.feed(html)
Start tag: html
Start tag: body
Start tag: h1
Data: TutorialsPoint
End tag: h1
Start tag: b
Data: Python Standard Library
End tag: b
Start tag: p
Data: HTML Parser Module
End tag: p
End tag: body
End tag: html

Additional HTMLParser Methods

The HTMLParser class provides several other useful methods ?

from html.parser import HTMLParser

class AdvancedParser(HTMLParser):
    def handle_comment(self, data):
        print("Comment:", data)
    
    def handle_startendtag(self, tag, attrs):
        print("Self-closing tag:", tag)
    
    def get_current_position(self):
        line, offset = self.getpos()
        print(f"Current position: line {line}, offset {offset}")

parser = AdvancedParser()

# Parse HTML with comments and self-closing tags
html_content = '''
<!-- This is a comment -->
<img src="image.jpg" alt="Sample" />
<p>Some text</p>
'''

parser.feed(html_content)
Comment:  This is a comment 
Self-closing tag: img
Start tag: p
Data: Some text
End tag: p

Method Summary

Method Purpose Parameters
handle_starttag() Process opening tags tag, attrs
handle_endtag() Process closing tags tag
handle_data() Process text content data
handle_comment() Process HTML comments data
handle_startendtag() Process self-closing tags tag, attrs
getpos() Get current parsing position None

Conclusion

HTMLParser provides a simple way to parse HTML documents by overriding handler methods in a custom subclass. Use handle_starttag(), handle_endtag(), and handle_data() for basic parsing, and additional methods for comments and self-closing tags.

Updated on: 2026-03-25T05:45:36+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements