Beautiful Soup - Encoding
All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.
Example - Checking Unicode
from bs4 import BeautifulSoup markup = "<p>I will display ÂŁ</p>" soup = BeautifulSoup(markup, "html.parser") print (soup.p) print (soup.p.string)
Output
<p>I will display ÂŁ</p> I will display ÂŁ
Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document's encoding and then convert it into Unicode.
However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding.
Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 −
Example - Misidentification of encoding
from bs4 import BeautifulSoup markup = b"<h1>\xed\xe5\xec\xf9</h1>" soup = BeautifulSoup(markup, 'html.parser') print (soup.h1) print (soup.original_encoding)
Output
<h1>νξΟĎ</h1> iso-8859-7
To resolve above issue, pass it to BeautifulSoup using from_encoding −
Example - Using from_encoding
from bs4 import BeautifulSoup markup = b"<h1>\xed\xe5\xec\xf9</h1>" soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8") print (soup.h1) print (soup.original_encoding)
Output
<h1>×××׊</h1> iso-8859-8
Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don't know the correct encoding but sure that Unicode, Dammit is showing wrong result.
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
Output encoding
The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format.
Example - Using ISO-8859-2 format
markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
</HEAD>
<BODY>
</BODY>
</HTML>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.prettify())
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
</head>
<body>
</body>
</html>
In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format.
If you don't want the generated output in UTF-8, you can assign the desired encoding in prettify().
markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
</HEAD>
<BODY>
</BODY>
</HTML>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print(soup.prettify("latin-1"))
Output
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n <head>\n <meta content="text/html; charset=latin-1" http-equiv="content-type"/>\n </head>\n <body>\n Ä Ä Ä Ĺ Ĺ \xf3 Ĺ Ĺş Ĺź Ä Ä Ä Ĺ Ĺ \xd3 Ĺ Ĺš Ĺť\n </body>\n</html>\n'
In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string −
markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
</HEAD>
<BODY>
<p>My first paragraph.</p>
<h1>My First Heading</h1>
</BODY>
</HTML>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print(soup.p.encode("latin-1"))
print(soup.h1.encode("latin-1"))
Output
b'<p>My first paragraph.</p>' b'<h1>My First Heading</h1>'
Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example −
markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
print(tag.encode("utf-8"))
Output
b'<b>\xe2\x98\x83</b>'
If you try to encode the above in "latin-1" or "ascii", it will generate "☃", indicating there is no representation for that.
markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
print (tag.encode("latin-1"))
print (tag.encode("ascii"))
Output
b'<b>â</b>' b'<b>â</b>'
Unicode, Dammit
Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don't need Beautifulsoup to do all this.