Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Convert Object to String



The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object.

Assuming that we have a following HTML document

html = '''
<p>Hello <b>World</b></p>
'''

Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python's builtin str() function.

The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn't add the <html> and <body> tags.

Example - Getting String from an Object

html = '''
<p>Hello <b>World</b></p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (str(soup))

Output

<p>Hello <b>World</b></p>

On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body>

Example - Getting String from an Object using html5lib parser

html = '''
<p>Hello <b>World</b></p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print (str(soup))

Output

<html><head></head><body><p>Hello <b>World</b></p>
</body></html>

The Tag object has a string property that returns a NavigableString object.

Example - Getting NavigableString

html = '''
<p>Hello <b>World</b></p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

tag = soup.find('b')
obj = (tag.string)
print (type(obj),obj)

Output

string <class 'bs4.element.NavigableString'> World

There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes.

If the HTML string is −

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

We try to obtain the text property of <p> tag

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

tag = soup.find('p')
obj = (tag.text)
print ( type(obj), obj)

Output

<class 'str'> Hello World

You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

tag = soup.find('p')
obj = tag.get_text()
print (type(obj),obj)

Output

<class 'str'> Hello World
Advertisements