Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to use Python Regular expression to extract URL from an HTML link?
In this article, we will learn how to extract URLs from HTML links using Python regular expressions. A URL (Uniform Resource Locator) identifies the location of a resource on the Internet, consisting of components like protocol, domain name, path, and port number.
Python's re module provides powerful regular expression capabilities for parsing and extracting URLs from HTML content. Regular expressions allow us to define search patterns that can identify and extract specific URL formats from text.
Regular Expressions in Python
A regular expression is a search pattern used to find matching strings in text. Python's re module provides several key methods for working with regular expressions −
re.search() − Finds the first match of a pattern
re.match() − Matches pattern only at the beginning of a string
re.findall() − Returns all non-overlapping matches as a list
re.sub() − Replaces matched patterns with a new string
For URL extraction, we primarily use re.findall() to capture all URLs present in the HTML content.
URL Structure
A typical URL has the following structure −
protocol://hostname:port/path?query#fragment
For example −
https://www.tutorialspoint.com/python/index.html Protocol: https Hostname: www.tutorialspoint.com Path: /python/index.html
Extract All URLs From HTML String
The following example extracts all URLs from an HTML string using a comprehensive regex pattern −
import re
# HTML string containing multiple links
html_text = '''<p>Visit our website: </p>
<a href="https://www.tutorialspoint.com">TutorialsPoint</a>
<a href="http://example.com/page.html">Example</a>
<a href="https://github.com/user/repo">GitHub</a>'''
# Regex pattern to match URLs starting with http or https
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
# Extract all URLs
urls = re.findall(url_pattern, html_text)
print("HTML Content:")
print(html_text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
The output displays all extracted URLs from the HTML content −
HTML Content:Visit our website:
TutorialsPoint Example GitHub Extracted URLs: 1. https://www.tutorialspoint.com 2. http://example.com/page.html 3. https://github.com/user/repo
Extract URLs from href Attributes
A more targeted approach is to extract URLs specifically from href attributes in anchor tags −
import re
# HTML content with anchor tags
html_content = '''
<div>
<a href="https://www.tutorialspoint.com/python">Python Tutorial</a>
<a href="https://www.tutorialspoint.com/java">Java Tutorial</a>
<a href="mailto:contact@example.com">Contact Us</a>
<a href="https://www.tutorialspoint.com/javascript">JavaScript</a>
</div>
'''
# Pattern to extract href attribute values (HTTP/HTTPS only)
href_pattern = r'href=["']?(https?://[^"'>\s]+)'
# Extract URLs from href attributes
urls = re.findall(href_pattern, html_content)
print("Extracted URLs from href attributes:")
for url in urls:
print(f"? {url}")
The output shows only HTTP/HTTPS URLs from href attributes −
Extracted URLs from href attributes: ? https://www.tutorialspoint.com/python ? https://www.tutorialspoint.com/java ? https://www.tutorialspoint.com/javascript
Extract Protocol and Hostname
This example demonstrates how to extract specific components like protocol and hostname from URLs −
import re
# Sample URL
url = 'https://www.tutorialspoint.com/python/index.html'
# Extract protocol
protocol = re.findall(r'(\w+)://', url)
print(f"Protocol: {protocol[0] if protocol else 'Not found'}")
# Extract hostname (with www)
hostname_with_www = re.findall(r'://([^/]+)', url)
print(f"Full hostname: {hostname_with_www[0] if hostname_with_www else 'Not found'}")
# Extract domain name (without www)
domain = re.findall(r'://(?:www\.)?([^/]+)', url)
print(f"Domain: {domain[0] if domain else 'Not found'}")
# Extract path
path = re.findall(r'://[^/]+(/.*)', url)
print(f"Path: {path[0] if path else '/'}")
The output shows different URL components −
Protocol: https Full hostname: www.tutorialspoint.com Domain: www.tutorialspoint.com Path: /python/index.html
Parse Complete URL Structure
This example uses grouped capturing to extract multiple URL components simultaneously −
import re
# Multiple URLs to parse
urls = [
'https://www.tutorialspoint.com/python/index.html',
'http://example.com/about.php',
'https://github.com/user/repository'
]
# Pattern with grouped capturing for protocol, domain, and path
pattern = r'(\w+)://([\w\-\.]+)(/[\w\-\./]*)?'
print("URL Component Analysis:")
print("-" * 50)
for url in urls:
matches = re.findall(pattern, url)
if matches:
protocol, domain, path = matches[0]
print(f"URL: {url}")
print(f" Protocol: {protocol}")
print(f" Domain: {domain}")
print(f" Path: {path if path else '/'}")
print()
The output provides detailed breakdown of each URL −
URL Component Analysis: -------------------------------------------------- URL: https://www.tutorialspoint.com/python/index.html Protocol: https Domain: www.tutorialspoint.com Path: /python/index.html URL: http://example.com/about.php Protocol: http Domain: example.com Path: /about.php URL: https://github.com/user/repository Protocol: https Domain: github.com Path: /user/repository
Common Regex Patterns for URL Extraction
Here are commonly used regex patterns for different URL extraction scenarios −
| Pattern | Description |
|---|---|
https?://[^\s]+ |
Basic HTTP/HTTPS URLs |
href=["']?(https?://[^"'>\s]+) |
URLs from href attributes |
(\w+)://([^/]+) |
Protocol and hostname |
https?://(?:www\.)?([^/]+) |
Domain name (excluding www) |
Error Handling and Validation
When working with URL extraction, it's important to handle cases where no URLs are found −
import re
def extract_urls(text):
"""Extract URLs from text with error handling"""
pattern = r'https?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
urls = re.findall(pattern, text)
if not urls:
return "No URLs found in the text"
return urls
# Test with different inputs
test_cases = [
"Visit https://www.tutorialspoint.com for tutorials",
"No links here just plain text",
"Multiple links: https://example.com and http://test.org"
]
for i, text in enumerate(test_cases, 1):
result = extract_urls(text)
print(f"Test {i}: {result}")
The output demonstrates handling of different scenarios −
Test 1: ['https://www.tutorialspoint.com'] Test 2: No URLs found in the text Test 3: ['https://example.com', 'http://test.org']
Conclusion
Python's re module provides powerful tools for extracting URLs from HTML content using regular expressions. The key techniques include using re.findall() with appropriate patterns to match HTTP/HTTPS URLs, extracting specific components like protocol and hostname, and handling various URL formats found in HTML links.
