Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Find All Comments



Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document.

In HTML and XML, the comment text is written between <!-- and --> tags.

<!-- Comment Text -->

The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between <!-- and --> is recognized as a Comment.

Example - Extracting comments

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')

comment = soup.b.string
print (comment, type(comment))

Output

This is a comment text in HTML <class 'bs4.element.Comment'>

To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument 'string' to find_all() method. We shall assign the return value of a function iscomment() to it.

comments = soup.find_all(string=iscomment)

The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function.

def iscomment(elem):
   return isinstance(elem, Comment)

The comments variable shall store all the comment text occurrences in an HTML content.

The following Python program scrapes the HTML content, and finds all the comments in it.

Example - Getting all comments

from bs4 import BeautifulSoup, Comment

html = """
<html>
   <head>
      <!-- Title of document -->
      <title>TutorialsPoint</title>
   </head>
   <body>
      <!-- Page heading -->
      <h2>Departmentwise Employees</h2>
      <!-- top level list-->
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <!-- first inner list -->
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul id="HR">
         <!-- second inner list -->
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

def iscomment(elem):
    return isinstance(elem, Comment)

comments = soup.find_all(string=iscomment)
print (comments)

Output

[' Title of document ', ' Page heading ', ' top level list', ' first inner list ', ' second inner list ']

The above output shows a list of all comments. We can also use a for loop over the collection of comments.

Example - Looping over Collection of Comments

from bs4 import BeautifulSoup, Comment

html = """
<html>
   <head>
      <!-- Title of document -->
      <title>TutorialsPoint</title>
   </head>
   <body>
      <!-- Page heading -->
      <h2>Departmentwise Employees</h2>
      <!-- top level list-->
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <!-- first inner list -->
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul id="HR">
         <!-- second inner list -->
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

def iscomment(elem):
    return isinstance(elem, Comment)

comments = soup.find_all(string=iscomment)

i=0
for comment in comments:
   i+=1
   print (i,".",comment)

Output

1 .  Title of document 
2 .  Page heading
3 .  top level list
4 .  first inner list
5 .  second inner list

In this chapter, we learned how to extract all the comment strings in a HTML document.

Advertisements