Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to remove empty tags using BeautifulSoup in Python?
BeautifulSoup is a Python library that pulls out data from HTML and XML files. Using BeautifulSoup, we can remove empty tags present in HTML or XML documents and convert the data into clean, human-readable format.
First, install the BeautifulSoup library using: pip install beautifulsoup4
Basic Example − Removing Empty Tags
Here's how to identify and remove empty tags from an HTML document ?
from bs4 import BeautifulSoup
# HTML document with empty tags
html_document = """
<html>
<body>
<p>Python is an interpreted, high-level programming language.</p>
<div></div>
<span> </span>
<p>Python emphasizes code readability.</p>
<strong></strong>
</body>
</html>
"""
# Create BeautifulSoup object
soup = BeautifulSoup(html_document, "html.parser")
# Remove empty tags
for tag in soup.find_all():
if len(tag.get_text(strip=True)) == 0:
tag.extract()
print(soup.prettify())
<html> <body> <p> Python is an interpreted, high-level programming language. </p> <p> Python emphasizes code readability. </p> </body> </html>
Handling Tags with Only Whitespace
The strip=True parameter ensures tags containing only whitespace are also removed ?
from bs4 import BeautifulSoup
html_content = """
<div>
<p>Valid content here</p>
<span> </span>
<em></em>
<strong>Bold text</strong>
</div>
"""
soup = BeautifulSoup(html_content, "html.parser")
print("Before removing empty tags:")
print(soup.prettify())
# Remove empty tags including whitespace-only tags
for tag in soup.find_all():
if not tag.get_text(strip=True):
tag.extract()
print("\nAfter removing empty tags:")
print(soup.prettify())
Before removing empty tags: <div> <p> Valid content here </p> <span> </span> <em> </em> <strong> Bold text </strong> </div> After removing empty tags: <div> <p> Valid content here </p> <strong> Bold text </strong> </div>
Removing Specific Empty Tags
You can target specific tag types instead of all tags ?
from bs4 import BeautifulSoup
html_data = """
<html>
<body>
<p>Content paragraph</p>
<p></p>
<div>Valid div</div>
<div></div>
<span>Text content</span>
<span></span>
</body>
</html>
"""
soup = BeautifulSoup(html_data, "html.parser")
# Remove only empty div and p tags
for tag in soup.find_all(['div', 'p']):
if not tag.get_text(strip=True):
tag.extract()
print(soup.prettify())
<html> <body> <p> Content paragraph </p> <div> Valid div </div> <span> Text content </span> <span> </span> </body> </html>
Key Points
-
get_text(strip=True)removes leading/trailing whitespace before checking if tag is empty -
extract()completely removes the tag from the document - Use
find_all()to iterate through all tags or specify particular tag names - The parser ("html.parser", "lxml") affects how BeautifulSoup handles the document
Conclusion
BeautifulSoup makes it easy to remove empty tags using get_text(strip=True) to identify empty content and extract() to remove unwanted tags. This helps clean up HTML documents by removing unnecessary markup.
