html5lib and lxml parsers in Python


html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. It can parse almost all the elements of an HTML doc, breaking it down into different tags and pieces which can be filtered out for various use cases. It parses the text the same way as done by the major browsers. It can also tackle broken HTML tags and add some necessary tags to complete the structure. Also it is written in pure python code.

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib.

Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

Example

from bs4 import BeautifulSoup
html5_structure = BeautifulSoup("<head><li></p>", "html5lib")
print(html5_structure)
lxml_structure = BeautifulSoup("<head><li></p>", "lxml")
print(lxml_structure)

Running the above code gives us the following result

Output

<html><head></head><body><li><p></p></li></body></html>
<html><head></head><body><li></li></body></html>

As we can see the html5lib creates more complete html document by incorporating the

tag. The lxml library is more focused towards xml like structure and ignores the tag completely.

Updated on: 20-Dec-2019

609 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements