html5lib and lxml parsers in Python

PythonServer Side ProgrammingProgramming

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. It can parse almost all the elements of an HTML doc, breaking it down into different tags and pieces which can be filtered out for various use cases. It parses the text the same way as done by the major browsers. It can also tackle broken HTML tags and add some necessary tags to complete the structure. Also it is written in pure python code.

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib.

Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

Example

from bs4 import BeautifulSoup
html5_structure = BeautifulSoup("<head><li></p>", "html5lib")
print(html5_structure)
lxml_structure = BeautifulSoup("<head><li></p>", "lxml")
print(lxml_structure)

Running the above code gives us the following result

Output

<html><head></head><body><li><p></p></li></body></html>
<html><head></head><body><li></li></body></html>

As we can see the html5lib creates more complete html document by incorporating the

tag. The lxml library is more focused towards xml like structure and ignores the tag completely.

raja
Published on 20-Dec-2019 15:48:23
Advertisements