html.parser — Simple HTML and XHTML parser in Python


The HTMLParser class defined in this module provides functionality to parse HTML and XHMTL documents. This class contains handler methods that can identify tags, data, comments and other HTML elements.

We have to define a new class that inherits HTMLParser class and submit HTML text using feed() method.

from html.parser import HTMLParser
class parser(HTMLParser):
pass
p = parser()
p.feed('<a href = "www.tutorialspoint.com"></a>')

We have to override its following methods

handle_starttag(tag, attrs):

HTML tags normally are in pairs of starting tag and end tag. For example <head> and </head>. This method is called to handle the start of a tag.

Name of the tag converted to lower case. The attrs argument stands for attributes found inside the tag’s <> brackets.

For instance, for the tag <a href = "www.tutorialspoint.com"></a>, is fed to the parser object.

from html.parser import HTMLParser
class parser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
p = parser()
p.feed('<a href = "www.tutorialspoint.com">')

Output

Start tag: a
attr: ('href', 'www.tutorialspoint.com')
handle_endtag(tag):

This method is called to handle the end tag of an element.

def handle_endtag(self, tag):
print ("end tag",tag)
handle_data(data):

This method is called to process arbitrary data between tags. For example:

def handle_data(self, data):
print (data)
p = parser()
html = '''
<html>
   <body>
      <h1>Tutorialspoint</h1>
      <b>Python standard library</b>
      <p>HTML module</p>
   </body>
</html>'''
p.feed(html)

Output

Start tag: h1
Tutorialspoint
end tag h1
Start tag: b
Python standard library
end tag b
Start tag: p
HTML module
end tag p

Other methods in HTMLParser class are as follows:

get_starttag_text()

Return the text of the most recently opened start tag.

getpos()

Return current line number and offset.

handle_startendtag(tag, attrs)

Similar to handle_starttag(), but called when the parser encounters an XHTML-style empty tag (<img ... />).

handle_comment(data)

This method is called when a comment is encountered (e.g. <!--comment-->).

Updated on: 30-Jul-2019

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements