Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Kinds of objects



When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package.

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comments

Tag Object

A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.

Example - Getting type of Tag.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (type(tag))

Output

<class 'bs4.element.Tag'>

Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

Name (tag.name)

Every tag contains a name and can be accessed through '.name' as suffix. tag.name will return the type of tag it is.

Example - Printing Name of the Tag

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (tag.name)

Output

html

However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.

Example - Changing the tag name

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)

Output

<strong><body><b class="boldest">TutorialsPoint</b></body></strong>

Attributes (tag.attrs)

A tag object can have any number of attributes. In the above example, the tag <b class="boldest"> has an attribute 'class' whose value is "boldest". Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by "attrs". You can access the attributes either through accessing the keys too.

In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by "attr".

Example - Printing attributes of a Tag

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)

Output

{'type': 'text', 'name': 'name', 'value': 'Raju'}

We can do all kind of modifications to our tag's attributes (add/remove/modify), using dictionary operators or methods.

In the following example, the value tag is updated. The updated HTML string shows changes.

Example - Updating tag value

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)
tag['value']='Ravi'
print (soup)

Output

<html><body><input name="name" type="text" value="Ravi"/></body></html>

We add a new id tag, and delete the value tag.

Example - Adding/Deleting Tag

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

tag['id']='nm'
del tag['value']
print (soup)

Output

<html><body><input id="nm" name="name" type="text"/></body></html>

Multi-valued attributes

Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include 'rel', 'rev', 'headers', 'accesskey' and 'accept-charset'. The multi-valued attributes in beautiful soup are shown as list.

Example - Using multivalued attributes

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

Output

css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']

However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −

Example - Invalid attribute values

from bs4 import BeautifulSoup

id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml')
print ("id_soup.p['id']:", id_soup.p['id'])
print ("type(id_soup.p['id']):", type(id_soup.p['id']))

Output

id_soup.p['id']: body bold
type(id_soup.p['id']): <class 'str'>

NavigableString object

Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold.

The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use ".string" with tag.

Example - Printing type of NavigableString

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')

print (soup.string)

print (type(soup.string))

Output

Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>

A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function.

Example - Converting NavigableString to String

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
string = str(tag.string)
print (string)

Output

Hello, Tutorialspoint!

Just as a Python string, which is immutable, the NavigableString also can't be modified in place. However, use replace_with() to replace the inner string of a tag with another.

Example - Updating a NavigableString

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

Output

OnLine Tutorials Library

BeautifulSoup object

The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree.

Example - Usage of BeautifulSoup

from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

print (soup)
print (soup.name)
print ('type:',type(soup))

Output

<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>

The name property of BeautifulSoup object always returns [document].

Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().

Example - Updating content using BeautifulSoup

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

Output

<html><body><book><title>Python</title></book></body></html>

Comment object

Any text written between <!-- and --> in HTML as well as XML document is treated as comment. BeautifulSoup can detect such commented text as a Comment object.

Example - Usage of Comment in Beautiful Soup

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))

Output

This is a comment text in HTML <class 'bs4.element.Comment'>

The Comment object is a special type of NavigableString object. The prettify() method displays the comment text with special formatting −

Example - Formatting a Comment

print (soup.b.prettify())

Output

<b>
   <!--This is a comment text in HTML-->
</b>
Advertisements