Beautiful Soup - Pretty Printing



To display the entire parsed tree of an HTML document or the contents of a specific tag, you can use the print() function or call str() function as well.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<h1>Hello World</h1>", "lxml")
print ("Tree:",soup)
print ("h1 tag:",str(soup.h1))

Output

Tree: <html><body><h1>Hello World</h1></body></html>
h1 tag: <h1>Hello World</h1>

The str() function returns a string encoded in UTF-8.

To get a nicely formatted Unicode string, use Beautiful Soup's prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree.

Consider the following HTML string.

<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>

Using the prettify() method we can better understand its structure −

html = '''
   <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
print (soup.prettify())

Output

<html>
 <body>
  <p>
   The quick,
   <b>
    brown fox
   </b>
   jumps over a lazy dog.
  </p>
 </body>
</html>

You can call prettify() on on any of the Tag objects in the document.

print (soup.b.prettify())

Output

<b>
 brown fox
</b>

The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document.

He prettify() method can optionally be provided formatter argument to specify the formatting to be used.

Advertisements