Article Categories

Selected Reading

Python Program to Extract Strings between HTML Tags

Python Server Side Programming Programming

HTML tags are used to design the skeleton of websites. We pass information and upload content in the form of strings enclosed within the tags. The strings between the HTML tags determines how the element will be displayed and interpreted by the browser. Therefore, the extraction of these strings plays a crucial role in data manipulation and processing.

These strings reveal the hidden pattern and logic behind the construction of a webpage. In this article, we will explore different methods to extract strings between HTML tags.

Understanding the Problem

We have to extract all the strings between the HTML tags. Our target strings are enclosed within different types of tags and only the content part should be retrieved. Let's understand this with the help of an example ?

Input Output Scenario

Let us consider a string ?

input_str = "<h1>This is a test string,</h1><p>Let's code together</p>"
print("Input:", input_str)

Input: <h1>This is a test string,</h1><p>Let's code together</p>

The expected output should extract the content between tags ?

Output: ['This is a test string,', "Let's code together"]

As we can see, the <h1> and <p> tags are removed and the strings are extracted. Now that we have understood the problem, let's discuss different solutions.

Method 1: Using Iterations and replace()

This approach focuses on the elimination and replacement of the HTML tags. We will pass a string and a list of different HTML tags, then replace each tag with an empty space using replace() method ?

input_str = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]

print(f"Original string: {input_str}")
extracted_str = input_str

for tag in tags:
    if tag in extracted_str:
        extracted_str = extracted_str.replace(tag, " ")

# Clean up extra spaces and create list
result = [extracted_str.strip()]
print(f"Extracted string: {result}")

Original string: <h1>This is a test string,</h1><p>Let's code together</p>
Extracted string: [' This is a test string,  Let's code together ']

Method 2: Using Regex Module with findall()

In this approach, we use the regex module for matching a particular pattern. We will pass a regular expression that represents the target pattern to catch both opening and closing tags ?

import re

input_str = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]

print(f"Original string: {input_str}")
extracted_strings = []

for tag in tags:
    pattern = f"<{tag}>(.*?)</{tag}>"
    matches = re.findall(pattern, input_str)
    extracted_strings.extend(matches)

print(f"Extracted strings: {extracted_strings}")

Original string: <h1>This is a test string,</h1><p>Let's code together</p>
Extracted strings: ['This is a test string,', "Let's code together"]

Method 3: Using find() Method with String Slicing

In this approach, we obtain the first occurrences of both the opening and closing tags using find() method. We iterate over each element in the tag list and retrieve its positions in the string ?

input_str = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
extracted_strings = []

print(f"Original string: {input_str}")

for tag in tags:
    start_tag = f"<{tag}>"
    end_tag = f"</{tag}>"
    
    start_pos = input_str.find(start_tag)
    while start_pos != -1:
        end_pos = input_str.find(end_tag, start_pos)
        if end_pos == -1:
            break
        
        # Extract content between tags
        content = input_str[start_pos + len(start_tag):end_pos]
        extracted_strings.append(content)
        
        # Find next occurrence
        start_pos = input_str.find(start_tag, end_pos)

print(f"Extracted strings: {extracted_strings}")

Original string: <h1>This is a test string,</h1><p>Let's code together</p>
Extracted strings: ['This is a test string,', "Let's code together"]

Method 4: Using BeautifulSoup (Recommended)

For robust HTML parsing, BeautifulSoup library provides a more reliable solution ?

from bs4 import BeautifulSoup

input_str = "<h1>This is a test string,</h1><p>Let's code together</p>"

soup = BeautifulSoup(input_str, 'html.parser')
extracted_strings = [tag.get_text() for tag in soup.find_all()]

print(f"Extracted strings: {extracted_strings}")

Comparison of Methods

Method	Complexity	Accuracy	Best For
replace()	Low	Low	Simple, known tags
Regex	Medium	Medium	Pattern matching
find() + Slicing	Medium	Medium	Manual control
BeautifulSoup	Low	High	Complex HTML parsing

Conclusion

We explored multiple approaches to extract strings between HTML tags. The regex method with findall() offers the best balance of simplicity and accuracy for most use cases. For complex HTML parsing, consider using BeautifulSoup library for more robust results.

Devesh Chauhan

Updated on: 2026-03-27T07:38:29+05:30

1K+ Views

Previous Next