- Trending Categories
- Data Structure
- Operating System
- C Programming
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
html5lib and lxml parsers in Python
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. It can parse almost all the elements of an HTML doc, breaking it down into different tags and pieces which can be filtered out for various use cases. It parses the text the same way as done by the major browsers. It can also tackle broken HTML tags and add some necessary tags to complete the structure. Also it is written in pure python code.
lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib.
Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.
from bs4 import BeautifulSoup html5_structure = BeautifulSoup("<head><li></p>", "html5lib") print(html5_structure) lxml_structure = BeautifulSoup("<head><li></p>", "lxml") print(lxml_structure)
Running the above code gives us the following result
As we can see the html5lib creates more complete html document by incorporating the
tag. The lxml library is more focused towards xml like structure and ignores the tag completely.
- Python Implementing web scraping using lxml
- Implementing web scraping using lxml in Python?
- What is Parsers?
- Implementing web scraping using lxml in Python Programming
- What are LR Parsers?
- What is Components of LR Parsers in compiler design?
- try and except in Python
- Permutation and Combination in Python?
- append() and extend() in Python
- *args and **kwargs in Python
- delattr() and del() in Python
- degrees() and radians() in Python
- Type and isinstance in Python
- Getter and Setter in Python
- max() and min() in Python