Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Searching Tree



In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions - going up and down, sideways, and back and forth.

The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element −

Example - Extract a Head Tree

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

print (soup.head.prettify())

Output

<head>
   <title>
      TutorialsPoint
   </title>
</head>

Going down

A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it.

Example - Getting tree of a content of a Tag

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.head 
print (list(tag.children))

Output

[<title>TutorialsPoint</title>]

The returned object is a list, although in this case, there is only a single child tag enclosed in head element.

Using .children property

The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list.

Example - List all enclosed elements

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.body 
print (list(tag.children))

Output

['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n', 
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']

Instead of getting them as a list, you can iterate over a tag's children using the .children generator −

Example - Iterating a List

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.body 
for child in tag.children:
   print (child)

Output

<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>

Using .descendents attribute

The .contents and .children attributes only consider a tag's direct children. The .descendants attribute lets you iterate over all of a tag's children, recursively: its direct children, the children of its direct children, and so on.

The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string.

Example - Usage of descendents attributes

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

print (soup.descendants)

Output

<generator object Tag.descendants at 0x7fb9333a9970>

The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag.

Example - Listing descendents of head tag.

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
   print (element)

Output

<title>TutorialsPoint</title>
TutorialsPoint

The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag's child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants.

Example - Getting Elements count

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tags = list(soup.descendants)
print (len(tags))

Output

27

Going Up

Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag

Using .parent atttribute

every tag and every string has a parent tag that contains it. You can access an element's parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag.

Example - Using .parent attribute

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
print (tag.parent)

Output

<head><title>TutorialsPoint</title></head>

Since the title tag contains a string (NavigableString), the parent for the string is title tag itself.

Example - Getting Title Tag

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
string = tag.string
print (string.parent)

Output

<title>TutorialsPoint</title>

Using .parents property

You can iterate over all of an element's parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string.

Example - Usage of .parents property

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.a 
print (tag.string)

for parent in tag.parents:
   print (parent.name)

Output

Python
p
body
html
[document]

Sideways

The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet

<p>
   <b>
      Hello
   </b>
   <i>
      Python
   </i>
</p>

In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level.

.next_sibling and .previous_sibling

These attributes respectively return the next tag at the same level, and the previous tag at same level.

Example - Getting Siblings

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.next_sibling)

tag2 = soup.i 
print ("previous:",tag2.previous_sibling)

Output

next: <i>Python</i>
previous: <b>Hello</b>

Since the <b> tag doesn't have a sibling to its left, and <i> tag doesn't have a sibling to its right, it returns None in both cases.

Example - Checking siblings if not present

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.previous_sibling)

tag2 = soup.i 
print ("previous:",tag2.next_sibling)

Output

next: None
previous: None

.next_siblings and .previous_siblings

If there are two or more siblings to the right or left of a tag, they can be navigated with the help of the .next_siblings and .previous_siblings attributes respectively. Both of them return generator object so that a for loop can be used to iterate.

Let us use the following HTML snippet for this purpose −

<p>
   <b>
      Excellent
   </b>
   <i>
      Python
   </i>
   <u>
      Tutorial
   </u>
</p>

Use the following code to traverse next and previous sibling tags.

Example - Traversing siblings

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.b 
print ("next siblings:")
for tag in tag1.next_siblings:
   print (tag)
print ("previous siblings:")
tag2 = soup.u 
for tag in tag2.previous_siblings:
   print (tag)

Output

next siblings:
<i>Python</i>
<u>Tutorial</u>
previous siblings:
<i>Python</i>
<b>Excellent</b>

Back and forth

In Beautiful Soup, the next_element property returns the next string or tag in the parse tree. On the other hand, the previous_element property returns the previous string or tag in the parse tree. Sometimes, the return value of next_element and previous_element attributes is similar to next_sibling and previous_sibling properties.

.next_element and .previous_element

Example - Usage of .next_element and .previous_element

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>

<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find("a", id="link3")
print (tag.next_element)

tag = soup.find("a", id="link1")
print (tag.previous_element)

Output

PHP
TutorialsPoint has an excellent collection of tutorials on:

The next_element after <a> tag with id = "link3" is the string PHP. Similarly, the previous_element returns the string before <a> tag with id = "link1".

.next_elements and .previous_elements

These attributes of the Tag object return generator respectively of all tags and strings after and before it.

Example - Iterating Next elements

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>

<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find("a", id="link1")
for element in tag.next_elements:
   print (element)

Output

Python
,

<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
 and

<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.


<p class="tutorial">...</p>
...

Example - Iterating Previous elements

from bs4 import BeautifulSoup

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>

<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find("body")
for element in tag.previous_elements:
   print (element)

Output

<html><head><title>TutorialsPoint</title></head>
Advertisements