Beautiful Soup - Error Handling



While trying to parse HTML/XML document with Beautiful Soup, you may encounter errors, not from your script but from the structure of the snippet because the BeautifulSoup API throws an error.

By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.

To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −

soup = BeautifulSoup(markup, "lxml-xml")

or

soup = BeautifulSoup(markup, "xml")

One common XML parsing error is −

AttributeError: 'NoneType' object has no attribute 'attrib'

This might happen in case, some element is missing or not defined while using find() or findall() function.

Apart from the above mentioned parsing errors, you may encounter other parsing issues such as environmental issues where your script might work in one operating system but not in another operating system or may work in one virtual environment but not in another virtual environment or may not work outside the virtual environment. All these issues may be because the two environments have different parser libraries available.

It is recommended to know or check your default parser in your current working environment. You can check the current default parser available for the current working environment or else pass explicitly the required parser library as second arguments to the BeautifulSoup constructor.

As the HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. However, if you want to preserve mixed-case or uppercase tags and attributes, then it is better to parse the document as XML.

UnicodeEncodeError

Let us look into below code segment −

Example

soup = BeautifulSoup(response, "html.parser")
   print (soup)

Output

UnicodeEncodeError: 'charmap' codec can't encode character '\u011f'

Above problem may be because of two main situations. You might be trying to print out a unicode character that your console doesn't know how to display. Second, you are trying to write to a file and you pass in a Unicode character that's not supported by your default encoding.

One way to resolve above problem is to encode the response text/character before making the soup to get the desired result, as follows −

responseTxt = response.text.encode('UTF-8')
KeyError: [attr]

It is caused by accessing tag['attr'] when the tag in question doesn't define the attr attribute. Most common errors are: "KeyError: 'href'" and "KeyError: 'class'". Use tag.get('attr') if you are not sure attr is defined.

for item in soup.fetch('a'):
   try:
      if (item['href'].startswith('/') or "tutorialspoint" in item['href']):
      (...)
   except KeyError:
      pass # or some other fallback action

AttributeError

You may encounter AttributeError as follows −

AttributeError: 'list' object has no attribute 'find_all'

The above error mainly occurs because you expected find_all() return a single tag or string. However, soup.find_all returns a python list of elements.

All you need to do is to iterate through the list and catch data from those elements.

To avoid the above errors when parsing a result, that result will be bypassed to make sure that a malformed snippet isn't inserted into the databases −

except(AttributeError, KeyError) as er:
   pass
Advertisements