Beautiful Soup - Parsing only section of a document



There are multiple situations where you want to extract specific types of information (only <a> tags) using Beautifulsoup4. The SoupStrainer class in Beautifulsoup allows you to parse only specific part of an incoming document.

One way is to create a SoupStrainer and pass it on to the Beautifulsoup4 constructor as the parse_only argument.

SoupStrainer

A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.

product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)

Above lines of code will parse only the titles from a product site, which might be inside a tag field.

Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −

from bs4 import BeautifulSoup, SoupStrainer

#Only "a" tags
only_a_tags = SoupStrainer("a")

#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

#parse only where string length is less than 10
def is_short_string(string):
   return len(string) < 10
   
only_short_strings =SoupStrainer(string=is_short_string)
Advertisements