How to use Python Regular expression to extract URL from an HTML link?


URL is an acronym for Uniform Resource Locator; it is used to identify the location resource on internet. For example, the following URLs are used to identify the location of Google and Microsoft websites −

https://www.google.com
https://www.microsoft.com

URL consists of domain name, path, port number etc. The URL can be parsed and processed by using Regular Expression. Therefore, if we want to use Regular Expression we have to use re library in Python.

Example

Following is the example demonstrating URL −

URL: https://www.tutorialspoint.com/courses
If we parse the above URL we can find the website name and protocol
Hostname: tutorialspoint.com
Protocol: https

Regular Expressions

In Python language, regular expression is one of the search pattern used to find matching strings.

Python has four methods which are used for regular expressions −

  • search() − It is used to find first match.

  • match() − it is used to find only identical match

  • findall() − it is used to find all matches

  • sub() − it is used to substitute string matching pattern with new string.

If we want to search a required pattern in URL by using Python language, we use re.findall() function which is a re library function.

Syntax

Following is the syntax or usage of searching function re.findall in python

re.findall(regex, string)

The above syntax returns all non-overlapping matches of patterns in a string as a list of strings.

Example

To extract a URL, we can use the following code −

import re
text= '<p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>'
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
print("Original string: ",text)
print("Urls:",urls)

Output

Following is the output of the above program, when executed.

Original string:  <p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>
Urls: ['http://tutorialspoint.com', 'https://www.tutorialspoint.com/market/index.asp']

Example

The below program demonstrate how to extract the Hostname and protocol from a given URL.

import re  
website = 'https://www.tutorialspoint.com/'
#to find protocol
object1 = re.findall('(\w+)://', website)
print(object1)
# To find host name
object2 = re.findall('://www.([\w\-\.]+)', website)
print(object2)

Output

Following is the output of the above program, when executed.

['https']
['tutorialspoint.com']

Example

Following program demonstrates the usage of general URL where path elements are constructed.

# Online Python-3 Compiler (Interpreter)

import re

# url
url = 'http://www.tutorialspoint.com/index.html' 

# finding  all capture groups
object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url)
print(object)

Output

Following is the output of the above program, when executed.

[('http', 'www.tutorialspoint.com', 'index', 'html')]

Updated on: 04-Oct-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements