Python Program to Extract Strings between HTML Tags


HTML tags are used to design the skeleton of websites. We pass information and upload content in the form of strings enclosed within the tags. The strings between the HTML tags determines how the element will be displayed and interpreted by the browser. Therefore, the extraction of these strings plays a crucial role in data manipulation and processing. We can analyse and understand the structure of the HTML document.

These strings reveal the hidden pattern and logic behind the construction of a webpage. In this article, we will be dealing with these strings. Our task is to extract the strings between HTML Tags.

Understanding the Problem

We have to extract all the strings between the HTML tags. Our target strings are enclosed within different types of tags and only the content part should be retrieved. Let’s understand this with the help of an example.

Input Output Scenario

Let us consider a string −

Input:
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"

The input string consists of different HTML tags and we have to extract the string between them.

Output: [" This is a test string,  Let's code together "]

As we can see, the “<h1>” and “<p>” tags are removed and the strings are extracted. Now that we have understood the problem, let’s discuss a few solutions.

Using Iterations and Replace()

This approach focuses on the elimination and replacement of the HTML tags. We will pass a string and a list of different HTML tags. After this we will initialize this string as an element for a list.

We will iterate over each element in the tag list and check whether or not it exists in the original string. We will pass a “pos” variable which will store the index value and drive the iteration process.

We will replace each tag with an empty space using “replace()” method and retrieve a HTML tag free string.

Example

Following is an example to extract strings between HTML tags −

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]
print(f"This is the original string: {Inp_STR}")
ExStr = [Inp_STR]
pos = 0

for tag in tags:
   if tag in ExStr[pos]:
      ExStr[pos] = ExStr[pos].replace(tag, " ")
pos += 1

print(f"The extracted string is : {ExStr}")

Output

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is : [" This is a test string,  Let's code together "]

Using Regex Module + Findall()

In this approach, we will use the regex module for matching a particular pattern. We will pass a regular expression: “<"+tag+">(.*?)</"+tag+">” that represents the target pattern. This pattern aims to catch both the opening and closing tags. Here, “tag” is a variable that fetches its value from a list of tags with the help of iterations.

The “findall()” function is used to find all the matches of the pattern in the original string. We will use the “extend()” method to add all the “matches” into a new list. In this manner we will extract the strings enclosed within the HTML tags.

Example

Following is an example −

import re
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
print(f"This is the original string: {Inp_STR}")
ExStr = []

for tag in tags:
   seq = "<"+tag+">(.*?)</"+tag+">"
   matches = re.findall(seq, Inp_STR)
   ExStr.extend(matches)
print(f"The extracted string is: {ExStr}")

Output

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

Using Iteration and Find()

In this approach, we will obtain the 1st occurrences of both the opening and closing tags in the original string with the help of “find()” method. We will iterate over each element in the tag list and retrieve its positions in the string.

A While loop will be used to continue the search for HTML tags in the string. We will establish a condition to check whether or not there is an incomplete tag in the string. On each iteration the index value is updated to find the next occurrence of the opening and closing tags.

The index value of all the opening and closing tags are stored and once the entire string is mapped, we use the string slicing to extract the string between the HTML tags.

Example

Following is an example −

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
ExStr = []
print(f"The original string is: {Inp_STR}")

for tag in tags:
   tagpos1 = Inp_STR.find("<"+tag+">")
   while tagpos1 != -1:
      tagpos2 = Inp_STR.find("</"+tag+">", tagpos1)
      if tagpos2 == -1:
         break
      ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2])
      tagpos1 = Inp_STR.find("<"+tag+">", tagpos2)

print(f"The extracted string is: {ExStr}")

Output

The original string is: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

Conclusion

During the course of this article, we discussed numerous ways in which we can extract strings between the HTML tags. We began with simpler solutions of locating and replacing the tags with spaces. We also used the regex module and its findall() function to find the matches to the patterns. We understood the application of find() method as well as string slicing.

Updated on: 12-Jul-2023

564 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements