- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - Output Formatting
If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.
An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −
< | less than | < | < |
> | greater than | > | > |
& | ampersand | & | & |
" | double quote | " | " |
' | single quote | ' | ' |
" | Left Double quote | “ | “ |
" | Right double quote | ” | ” |
£ | Pound | £ | £ |
¥ | yen | ¥ | ¥ |
€ | euro | € | € |
© | copyright | © | © |
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">"
For others, they'll be converted to Unicode characters.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("Hello “World!”", 'html.parser') print (str(soup))
Output
Hello "World!"
If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won't get the HTML entities back −
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("Hello “World!”", 'html.parser') print (soup.encode())
Output
b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'
To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.
formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML
formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter="html5" − it's similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br"
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML
Example
from bs4 import BeautifulSoup french = "<p>Il a dit <<Sacré bleu!>></p>" soup = BeautifulSoup(french, 'html.parser') print ("minimal: ") print(soup.prettify(formatter="minimal")) print ("html: ") print(soup.prettify(formatter="html")) print ("None: ") print(soup.prettify(formatter=None))
Output
minimal: <p> Il a dit <<Sacré bleu!>> </p> html: <p> Il a dit <<Sacré bleu!>> </p> None: <p> Il a dit <<Sacré bleu!>> </p>
In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.
HTMLFormatter class − Used to customize the formatting rules for HTML documents.
XMLFormatter class − Used to customize the formatting rules for XML documents.