Article Categories

Selected Reading

html.parser — Simple HTML and XHTML parser in Python

Python Server Side Programming Programming

The html.parser module in Python's standard library provides the HTMLParser class for parsing HTML and XHTML documents. This class contains handler methods that can identify tags, data, comments, and other HTML elements.

To use HTMLParser, create a subclass that inherits from HTMLParser and override specific handler methods to process different HTML elements.

Basic HTMLParser Setup

Here's the basic structure for creating a custom HTML parser ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    pass

parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com"></a>')

Handling Start Tags

The handle_starttag(tag, attrs) method is called when the parser encounters an opening HTML tag. The tag name is converted to lowercase, and attributes are provided as a list of tuples ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("  attr:", attr)

parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com" class="link">')

Start tag: a
  attr: ('href', 'https://www.tutorialspoint.com')
  attr: ('class', 'link')

Handling End Tags and Data

Override handle_endtag(tag) for closing tags and handle_data(data) for text content between tags ?

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
    
    def handle_endtag(self, tag):
        print("End tag:", tag)
    
    def handle_data(self, data):
        if data.strip():  # Only print non-empty data
            print("Data:", data.strip())

parser = MyParser()
html = '''
<html>
   <body>
      <h1>TutorialsPoint</h1>
      <b>Python Standard Library</b>
      <p>HTML Parser Module</p>
   </body>
</html>'''

parser.feed(html)

Start tag: html
Start tag: body
Start tag: h1
Data: TutorialsPoint
End tag: h1
Start tag: b
Data: Python Standard Library
End tag: b
Start tag: p
Data: HTML Parser Module
End tag: p
End tag: body
End tag: html

Additional HTMLParser Methods

The HTMLParser class provides several other useful methods ?

from html.parser import HTMLParser

class AdvancedParser(HTMLParser):
    def handle_comment(self, data):
        print("Comment:", data)
    
    def handle_startendtag(self, tag, attrs):
        print("Self-closing tag:", tag)
    
    def get_current_position(self):
        line, offset = self.getpos()
        print(f"Current position: line {line}, offset {offset}")

parser = AdvancedParser()

# Parse HTML with comments and self-closing tags
html_content = '''
<!-- This is a comment -->
<img src="image.jpg" alt="Sample" />
<p>Some text</p>
'''

parser.feed(html_content)

Comment:  This is a comment 
Self-closing tag: img
Start tag: p
Data: Some text
End tag: p

Method Summary

Method	Purpose	Parameters
`handle_starttag()`	Process opening tags	tag, attrs
`handle_endtag()`	Process closing tags	tag
`handle_data()`	Process text content	data
`handle_comment()`	Process HTML comments	data
`handle_startendtag()`	Process self-closing tags	tag, attrs
`getpos()`	Get current parsing position	None

Conclusion

HTMLParser provides a simple way to parse HTML documents by overriding handler methods in a custom subclass. Use handle_starttag(), handle_endtag(), and handle_data() for basic parsing, and additional methods for comments and self-closing tags.

Vrundesha Joshi

Updated on: 2026-03-25T05:45:36+05:30

2K+ Views

Previous Next