Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Find All Headings



In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document.

Example - Filtering out headings from all tags

In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags.

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>BeautifulSoup - Scraping Headings</title>
   </head>
   <body>
      <h2>Scraping Headings</h2>
      <b>The quick, brown fox jumps over a lazy dog.</b>
      <h3>Paragraph Heading</h3>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <h3>List heading</h3>
      <ul>
         <li>Junk MTV quiz graced by fox whelps.</li>
         <li>Bawds jog, flick quartz, vex nymphs.</li>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

headings = ['h1','h2','h3', 'h4', 'h5', 'h6']
tags = soup.find_all()
heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings]
print (heads)

Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads.

Output

[('h2', 'Scraping Headings'), ('h3', 'Paragraph Heading'), ('h3', 'List heading')]

Example - Getting Headings using Regex

You can pass a regex expression to the find_all() method. Take a look at the following regex.

re.compile('^h[1-6]$')

This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below −

from bs4 import BeautifulSoup
import re

html = """
<html>
   <head>
      <title>BeautifulSoup - Scraping Headings</title>
   </head>
   <body>
      <h2>Scraping Headings</h2>
      <b>The quick, brown fox jumps over a lazy dog.</b>
      <h3>Paragraph Heading</h3>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <h3>List heading</h3>
      <ul>
         <li>Junk MTV quiz graced by fox whelps.</li>
         <li>Bawds jog, flick quartz, vex nymphs.</li>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

tags = soup.find_all(re.compile('^h[1-6]$'))
print (tags)

Output

[<h2>Scraping Headings</h2>, <h3>Paragraph Heading</h3>, <h3>List heading</h3>]
Advertisements