Beautiful Soup - Parsing XML



BeautifulSoup can also parse a XML document. You need to pass fatures='xml' argument to Beautiful() constructor.

Assuming that we have the following books.xml in the current working directory −

Example

<?xml version="1.0" ?>
<books>
   <book>
      <title>Python</title>
      <author>TutorialsPoint</author>
      <price>400</price>
   </book>
</books> 

The following code parses the given XML file −

from bs4 import BeautifulSoup
fp = open("books.xml")
soup = BeautifulSoup(fp,  features="xml")

print (soup)
print ('type:', type(soup)) 

When the above code is executed, you should get the following result −

<?xml version="1.0" encoding="utf-8"?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
type: <class 'bs4.BeautifulSoup'> 

XML parser Error

By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.

To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −

soup = BeautifulSoup(markup, "lxml-xml")

or

soup = BeautifulSoup(markup, "xml")

One common XML parsing error is −

AttributeError: 'NoneType' object has no attribute 'attrib'

This might happen in case, some element is missing or not defined while using find() or findall() function.

Advertisements