Beautiful Soup - Searching the Tree



In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions - going up and down, sideways, and back and forth.

We shall use the following HTML string in all the examples in this chapter −

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

print (soup.head.prettify())

Output

<head>
   <title>
      TutorialsPoint
   </title>
</head>

Going down

A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.head 
print (list(tag.children))

Output

[<title>TutorialsPoint</title>]

The returned object is a list, although in this case, there is only a single child tag enclosed in head element.

.children

The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.body 
print (list(tag.children))

Output

['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n', 
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']

Instead of getting them as a list, you can iterate over a tag's children using the .children generator −

Example

tag = soup.body 
for child in tag.children:
   print (child)

Output

<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>


<p class="tutorial">...</p>

.descendents

The .contents and .children attributes only consider a tag's direct children. The .descendants attribute lets you iterate over all of a tag's children, recursively: its direct children, the children of its direct children, and so on.

The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

print (soup.descendants)

The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
   print (element)

Output

<title>TutorialsPoint</title>
TutorialsPoint

The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag's child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = list(soup.descendants)
print (len(tags))

Output

27

Going Up

Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag

.parent

every tag and every string has a parent tag that contains it. You can access an element's parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
print (tag.parent)

Output

<head><title>TutorialsPoint</title></head>

Since the title tag contains a string (NavigableString), the parent for the string is title tag itself.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
string = tag.string
print (string.parent)

Output

<title>TutorialsPoint</title>

.parents

You can iterate over all of an element's parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.a 
print (tag.string)

for parent in tag.parents:
   print (parent.name)

Output

Python
p
body
html
[document]

Sideways

The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet

<p>
   <b>
      Hello
   </b>
   <i>
      Python
   </i>
</p>

In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level.

.next_sibling and .previous_sibling

These attributes respectively return the next tag at the same level, and the previous tag at same level.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.next_sibling)

tag2 = soup.i 
print ("previous:",tag2.previous_sibling)

Output

next: <i>Python</i>
previous: <b>Hello</b>

Since the <b> tag doesn't have a sibling to its left, and <i> tag doesn't have a sibling to its right, it returns Nobe in both cases.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.previous_sibling)

tag2 = soup.i 
print ("previous:",tag2.next_sibling)

Output

next: None
previous: None

.next_siblings and .previous_siblings

If there are two or more siblings to the right or left of a tag, they can be navigated with the help of the .next_siblings and .previous_siblings attributes respectively. Both of them return generator object so that a for loop can be used to iterate.

Let us use the following HTML snippet for this purpose −

<p>
   <b>
      Excellent
   </b>
   <i>
      Python
   </i>
   <u>
      Tutorial
   </u>
</p>

Use the following code to traverse next and previous sibling tags.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.b 
print ("next siblings:")
for tag in tag1.next_siblings:
   print (tag)
print ("previous siblings:")
tag2 = soup.u 
for tag in tag2.previous_siblings:
   print (tag)

Output

next siblings:
<i>Python</i>
<u>Tutorial</u>
previous siblings:
<i>Python</i>
<b>Excellent</b>

Back and forth

In Beautiful Soup, the next_element property returns the next string or tag in the parse tree. On the other hand, the previous_element property returns the previous string or tag in the parse tree. Sometimes, the return value of next_element and previous_element attributes is similar to next_sibling and previous_sibling properties.

.next_element and .previous_element

Example

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>

<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find("a", id="link3")
print (tag.next_element)

tag = soup.find("a", id="link1")
print (tag.previous_element)

Output

PHP
TutorialsPoint has an excellent collection of tutorials on:

The next_element after <a> tag with id = "link3" is the string PHP. Similarly, the previous_element returns the string before <a> tag with id = "link1".

.next_elements and .previous_elements

These attributes of the Tag object return generator respectively of all tags and strings after and before it.

Next elements example

tag = soup.find("a", id="link1")
for element in tag.next_elements:
   print (element)

Output

Python
,

<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
 and

<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.


<p class="tutorial">...</p>
...

Previous elements example

tag = soup.find("body")
for element in tag.previous_elements:
   print (element)

Output

<html><head><title>TutorialsPoint</title></head>
Advertisements