Beautiful Soup - Navigating by Tags


Advertisements

In this chapter, we shall discuss about Navigating by Tags.

Below is our html document −

>>> html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="link1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="link2">C</a>,
<a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="link3">Python</a>,
<a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="link4">JavaScript</a> and
<a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>>

Based on the above document, we will try to move from one part of document to another.

Going down

One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag’s children). Beautiful Soup provides different ways to navigate and iterate over’s tag’s children.

Navigating using tag names

Easiest way to search a parse tree is to search the tag by its name. If you want the <head> tag, use soup.head −

>>> soup.head
<head>&t;title>Tutorials Point</title></head>
>>> soup.title
<title>Tutorials Point</title>

To get specific tag (like first <b> tag) in the <body> tag.

>>> soup.body.b
<b>The Biggest Online Tutorials Library, It's all Free</b>

Using a tag name as an attribute will give you only the first tag by that name −

>>> soup.a
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>

To get all the tag’s attribute, you can use find_all() method −

>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>]>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>]

.contents and .children

We can search tag’s children in a list by its .contents −

>>> head_tag = soup.head
>>> head_tag
<head><title>Tutorials Point</title></head>
>>> Htag = soup.head
>>> Htag
<head><title>Tutorials Point</title></head>
>>>
>>> Htag.contents
[<title>Tutorials Point</title>
>>>
>>> Ttag = head_tag.contents[0]
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.contents
['Tutorials Point']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object −

>>> len(soup.contents)
2
>>> soup.contents[1].name
'html'

A string does not have .contents, because it can’t contain anything −

>>> text = Ttag.contents[0]
>>> text.contents
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, use .children generator to access tag’s children −

>>> for child in Ttag.children:
print(child)
Tutorials Point

.descendants

The .descendants attribute allows you to iterate over all of a tag’s children, recursively −

its direct children and the children of its direct children and so on −

>>> for child in Htag.descendants:
print(child)
<title>Tutorials Point</title>
Tutorials Point

The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. The beautifulsoup object has only one direct child (the <html> tag), but it has a whole lot of descendants −

>>> len(list(soup.children))
2
>>> len(list(soup.descendants))
33

.string

If the tag has only one child, and that child is a NavigableString, the child is made available as .string −

>>> Ttag.string
'Tutorials Point'

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child −

>>> Htag.contents
[<title>Tutorials Point</title>]
>>>
>>> Htag.string
'Tutorials Point'

However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None −

>>> print(soup.html.string)
None

.strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator −

>>> for string in soup.strings:
print(repr(string))
'\n'
'Tutorials Point'
'\n'
'\n'
"The Biggest Online Tutorials Library, It's all Free"
'\n'
'Top 5 most used Programming Languages are: \n'
'Java'
',\n'
'C'
',\n'
'Python'
',\n'
'JavaScript'
' and\n'
'C'
';\n \nas per online survey.'
'\n'
'Programming Languages'
'\n'

To remove extra whitespace, use .stripped_strings generator −

>>> for string in soup.stripped_strings:
print(repr(string))
'Tutorials Point'
"The Biggest Online Tutorials Library, It's all Free"
'Top 5 most used Programming Languages are:'
'Java'
','
'C'
','
'Python'
','
'JavaScript'
'and'
'C'
';\n \nas per online survey.'
'Programming Languages'

Going up

In a “family tree” analogy, every tag and every string has a parent: the tag that contain it:

.parent

To access the element’s parent element, use .parent attribute.

>>> Ttag = soup.title
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.parent
<head>title>Tutorials Point</title></head>

In our html_doc, the title string itself has a parent: the <title> tag that contain it−

>>> Ttag.string.parent
<title>Tutorials Point</title>

The parent of a top-level tag like <html> is the Beautifulsoup object itself −

>>> htmltag = soup.html
>>> type(htmltag.parent)
<class 'bs4.BeautifulSoup'>

The .parent of a Beautifulsoup object is defined as None −

>>> print(soup.parent)
None

.parents

To iterate over all the parents elements, use .parents attribute.

>>> link = soup.a
>>> link
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
>>>
>>> for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]

Going sideways

Below is one simple document −

>>> sibling_soup = BeautifulSoup("<a><b>TutorialsPoint</b><c><strong>The Biggest Online Tutorials Library, It's all Free</strong></b></a>")
>>> print(sibling_soup.prettify())
<html>
<body>
   <a>
      <b>
         TutorialsPoint
      </b>
      <c>
         <strong>
            The Biggest Online Tutorials Library, It's all Free
         </strong>
      </c>
   </a>
</body>
</html>

In the above doc, <b> and <c> tag is at the same level and they are both children of the same tag. Both <b> and <c> tag are siblings.

.next_sibling and .previous_sibling

Use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

>>> sibling_soup.b.next_sibling
<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
>>>
>>> sibling_soup.c.previous_sibling
<b>TutorialsPoint</b>

The <b> tag has a .next_sibling but no .previous_sibling, as there is nothing before the <b> tag on the same level of the tree, same case is with <c> tag.

>>> print(sibling_soup.b.previous_sibling)
None
>>> print(sibling_soup.c.next_sibling)
None

The two strings are not siblings, as they don’t have the same parent.

>>> sibling_soup.b.string
'TutorialsPoint'
>>>
>>> print(sibling_soup.b.string.next_sibling)
None

.next_siblings and .previous_siblings

To iterate over a tag’s siblings use .next_siblings and .previous_siblings.

>>> for sibling in soup.a.next_siblings:
print(repr(sibling))
',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
>a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
' and\n'
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm"
id="link5">C</a>
';\n \nas per online survey.'
>>> for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
'Top 5 most used Programming Languages are: \n'

Going back and forth

Now let us get back to first two lines in our previous “html_doc” example −

&t;html><head><title>Tutorials Point</title></head>
<body>
<h4 class="tagLine"><b>The Biggest Online Tutorials Library, It's all Free</b></h4>

An HTML parser takes above string of characters and turns it into a series of events like “open an <html> tag”, “open an <head> tag”, “open the <title> tag”, “add a string”, “close the </title> tag”, “close the </head> tag”, “open a <h4> tag” and so on. BeautifulSoup offers different methods to reconstructs the initial parse of the document.

.next_element and .previous_element

The .next_element attribute of a tag or string points to whatever was parsed immediately afterwards. Sometimes it looks similar to .next_sibling, however it is not same entirely. Below is the final <a> tag in our “html_doc” example document.

>>> last_a_tag = soup.find("a", id="link5")
>>> last_a_tag
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
>>> last_a_tag.next_sibling
';\n \nas per online survey.'

However the .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it is the word “C”:

>>> last_a_tag.next_element
'C'

Above behavior is because in the original markup, the letter “C” appeared before that semicolon. The parser encountered an <a> tag, then the letter “C”, then the closing </a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the <a> tag, but the letter “C” was encountered first.

The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one.

>>> last_a_tag.previous_element
' and\n'
>>>
>>> last_a_tag.previous_element.next_element
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>

.next_elements and .previous_elements

We use these iterators to move forward and backward to an element.

>>> for element in last_a_tag.next_e lements:
print(repr(element))
'C'
';\n \nas per online survey.'
'\n'
<p class="prog">Programming Languages</p>
'Programming Languages'
'\n'
Advertisements