Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
html.parser — Simple HTML and XHTML parser in Python
The html.parser module in Python's standard library provides the HTMLParser class for parsing HTML and XHTML documents. This class contains handler methods that can identify tags, data, comments, and other HTML elements.
To use HTMLParser, create a subclass that inherits from HTMLParser and override specific handler methods to process different HTML elements.
Basic HTMLParser Setup
Here's the basic structure for creating a custom HTML parser ?
from html.parser import HTMLParser
class MyParser(HTMLParser):
pass
parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com"></a>')
Handling Start Tags
The handle_starttag(tag, attrs) method is called when the parser encounters an opening HTML tag. The tag name is converted to lowercase, and attributes are provided as a list of tuples ?
from html.parser import HTMLParser
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
parser = MyParser()
parser.feed('<a href="https://www.tutorialspoint.com" class="link">')
Start tag: a
attr: ('href', 'https://www.tutorialspoint.com')
attr: ('class', 'link')
Handling End Tags and Data
Override handle_endtag(tag) for closing tags and handle_data(data) for text content between tags ?
from html.parser import HTMLParser
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
def handle_endtag(self, tag):
print("End tag:", tag)
def handle_data(self, data):
if data.strip(): # Only print non-empty data
print("Data:", data.strip())
parser = MyParser()
html = '''
<html>
<body>
<h1>TutorialsPoint</h1>
<b>Python Standard Library</b>
<p>HTML Parser Module</p>
</body>
</html>'''
parser.feed(html)
Start tag: html Start tag: body Start tag: h1 Data: TutorialsPoint End tag: h1 Start tag: b Data: Python Standard Library End tag: b Start tag: p Data: HTML Parser Module End tag: p End tag: body End tag: html
Additional HTMLParser Methods
The HTMLParser class provides several other useful methods ?
from html.parser import HTMLParser
class AdvancedParser(HTMLParser):
def handle_comment(self, data):
print("Comment:", data)
def handle_startendtag(self, tag, attrs):
print("Self-closing tag:", tag)
def get_current_position(self):
line, offset = self.getpos()
print(f"Current position: line {line}, offset {offset}")
parser = AdvancedParser()
# Parse HTML with comments and self-closing tags
html_content = '''
<!-- This is a comment -->
<img src="image.jpg" alt="Sample" />
<p>Some text</p>
'''
parser.feed(html_content)
Comment: This is a comment Self-closing tag: img Start tag: p Data: Some text End tag: p
Method Summary
| Method | Purpose | Parameters |
|---|---|---|
handle_starttag() |
Process opening tags | tag, attrs |
handle_endtag() |
Process closing tags | tag |
handle_data() |
Process text content | data |
handle_comment() |
Process HTML comments | data |
handle_startendtag() |
Process self-closing tags | tag, attrs |
getpos() |
Get current parsing position | None |
Conclusion
HTMLParser provides a simple way to parse HTML documents by overriding handler methods in a custom subclass. Use handle_starttag(), handle_endtag(), and handle_data() for basic parsing, and additional methods for comments and self-closing tags.
