- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - Encoding
All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.
Example
from bs4 import BeautifulSoup markup = "<p>I will display £</p>" soup = BeautifulSoup(markup, "html.parser") print (soup.p) print (soup.p.string)
Output
<p>I will display £</p> I will display £
Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document's encoding and then convert it into Unicode.
However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding.
Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 −
Example
from bs4 import BeautifulSoup markup = b"<h1>\xed\xe5\xec\xf9</h1>" soup = BeautifulSoup(markup, 'html.parser') print (soup.h1) print (soup.original_encoding)
Output
<h1>翴檛</h1> ISO-8859-7
To resolve above issue, pass it to BeautifulSoup using from_encoding −
Example
from bs4 import BeautifulSoup markup = b"<h1>\xed\xe5\xec\xf9</h1>" soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8") print (soup.h1) print (soup.original_encoding)
Output
<h1>םולש</h1> iso-8859-8
Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don't know the correct encoding but sure that Unicode, Dammit is showing wrong result.
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
Output encoding
The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format.
Example
markup = """ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2"> </HEAD> <BODY> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </BODY> </HTML> """ from bs4 import BeautifulSoup soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8") print (soup.prettify())
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> </head> <body> ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż </body> </html>
In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format.
If you don't want the generated output in UTF-8, you can assign the desired encoding in prettify().
print(soup.prettify("latin-1"))
Output
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n <head>\n <meta content="text/html; charset=latin-1" http-equiv="content-type"/>\n </head>\n <body>\n ą ć ę ł ń \xf3 ś ź ż Ą Ć Ę Ł Ń \xd3 Ś Ź Ż\n </body>\n</html>\n'
In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string −
soup.p.encode("latin-1") soup.h1.encode("latin-1")
Output
b'<p>My first paragraph.</p>' b'<h1>My First Heading</h1>'
Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example −
markup = u"<b>\N{SNOWMAN}</b>" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b print(tag.encode("utf-8"))
Output
b'<b>\xe2\x98\x83</b>'
If you try to encode the above in "latin-1" or "ascii", it will generate "☃", indicating there is no representation for that.
print (tag.encode("latin-1")) print (tag.encode("ascii"))
Output
b'<b>☃</b>' b'<b>☃</b>'
Unicode, Dammit
Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don't need Beautifulsoup to do all this.