How to use Python Regular expression to extract URL from an HTML link?

In this article, we will learn how to extract URLs from HTML links using Python regular expressions. A URL (Uniform Resource Locator) identifies the location of a resource on the Internet, consisting of components like protocol, domain name, path, and port number.

Python's re module provides powerful regular expression capabilities for parsing and extracting URLs from HTML content. Regular expressions allow us to define search patterns that can identify and extract specific URL formats from text.

Regular Expressions in Python

A regular expression is a search pattern used to find matching strings in text. Python's re module provides several key methods for working with regular expressions −

  • re.search() − Finds the first match of a pattern

  • re.match() − Matches pattern only at the beginning of a string

  • re.findall() − Returns all non-overlapping matches as a list

  • re.sub() − Replaces matched patterns with a new string

For URL extraction, we primarily use re.findall() to capture all URLs present in the HTML content.

URL Structure

A typical URL has the following structure −

protocol://hostname:port/path?query#fragment

For example −

https://www.tutorialspoint.com/python/index.html
Protocol: https
Hostname: www.tutorialspoint.com  
Path: /python/index.html
URL Components https://www.tutorialspoint.com/python/index.html Protocol Hostname Path

Extract All URLs From HTML String

The following example extracts all URLs from an HTML string using a comprehensive regex pattern −

import re

# HTML string containing multiple links
html_text = '''<p>Visit our website: </p>
<a href="https://www.tutorialspoint.com">TutorialsPoint</a>
<a href="http://example.com/page.html">Example</a>
<a href="https://github.com/user/repo">GitHub</a>'''

# Regex pattern to match URLs starting with http or https
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# Extract all URLs
urls = re.findall(url_pattern, html_text)

print("HTML Content:")
print(html_text)
print("\nExtracted URLs:")
for i, url in enumerate(urls, 1):
    print(f"{i}. {url}")

The output displays all extracted URLs from the HTML content −

HTML Content:

Visit our website:

TutorialsPoint Example GitHub Extracted URLs: 1. https://www.tutorialspoint.com 2. http://example.com/page.html 3. https://github.com/user/repo

Extract URLs from href Attributes

A more targeted approach is to extract URLs specifically from href attributes in anchor tags −

import re

# HTML content with anchor tags
html_content = '''
<div>
    <a href="https://www.tutorialspoint.com/python">Python Tutorial</a>
    <a href="https://www.tutorialspoint.com/java">Java Tutorial</a>
    <a href="mailto:contact@example.com">Contact Us</a>
    <a href="https://www.tutorialspoint.com/javascript">JavaScript</a>
</div>
'''

# Pattern to extract href attribute values (HTTP/HTTPS only)
href_pattern = r'href=["']?(https?://[^"'>\s]+)'

# Extract URLs from href attributes
urls = re.findall(href_pattern, html_content)

print("Extracted URLs from href attributes:")
for url in urls:
    print(f"? {url}")

The output shows only HTTP/HTTPS URLs from href attributes −

Extracted URLs from href attributes:
? https://www.tutorialspoint.com/python
? https://www.tutorialspoint.com/java
? https://www.tutorialspoint.com/javascript

Extract Protocol and Hostname

This example demonstrates how to extract specific components like protocol and hostname from URLs −

import re

# Sample URL
url = 'https://www.tutorialspoint.com/python/index.html'

# Extract protocol
protocol = re.findall(r'(\w+)://', url)
print(f"Protocol: {protocol[0] if protocol else 'Not found'}")

# Extract hostname (with www)
hostname_with_www = re.findall(r'://([^/]+)', url)
print(f"Full hostname: {hostname_with_www[0] if hostname_with_www else 'Not found'}")

# Extract domain name (without www)
domain = re.findall(r'://(?:www\.)?([^/]+)', url)
print(f"Domain: {domain[0] if domain else 'Not found'}")

# Extract path
path = re.findall(r'://[^/]+(/.*)', url)
print(f"Path: {path[0] if path else '/'}")

The output shows different URL components −

Protocol: https
Full hostname: www.tutorialspoint.com
Domain: www.tutorialspoint.com
Path: /python/index.html

Parse Complete URL Structure

This example uses grouped capturing to extract multiple URL components simultaneously −

import re

# Multiple URLs to parse
urls = [
    'https://www.tutorialspoint.com/python/index.html',
    'http://example.com/about.php',
    'https://github.com/user/repository'
]

# Pattern with grouped capturing for protocol, domain, and path
pattern = r'(\w+)://([\w\-\.]+)(/[\w\-\./]*)?'

print("URL Component Analysis:")
print("-" * 50)

for url in urls:
    matches = re.findall(pattern, url)
    if matches:
        protocol, domain, path = matches[0]
        print(f"URL: {url}")
        print(f"  Protocol: {protocol}")
        print(f"  Domain: {domain}")
        print(f"  Path: {path if path else '/'}")
        print()

The output provides detailed breakdown of each URL −

URL Component Analysis:
--------------------------------------------------
URL: https://www.tutorialspoint.com/python/index.html
  Protocol: https
  Domain: www.tutorialspoint.com
  Path: /python/index.html

URL: http://example.com/about.php
  Protocol: http
  Domain: example.com
  Path: /about.php

URL: https://github.com/user/repository
  Protocol: https
  Domain: github.com
  Path: /user/repository

Common Regex Patterns for URL Extraction

Here are commonly used regex patterns for different URL extraction scenarios −

Pattern Description
https?://[^\s]+ Basic HTTP/HTTPS URLs
href=["']?(https?://[^"'>\s]+) URLs from href attributes
(\w+)://([^/]+) Protocol and hostname
https?://(?:www\.)?([^/]+) Domain name (excluding www)

Error Handling and Validation

When working with URL extraction, it's important to handle cases where no URLs are found −

import re

def extract_urls(text):
    """Extract URLs from text with error handling"""
    pattern = r'https?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    urls = re.findall(pattern, text)
    
    if not urls:
        return "No URLs found in the text"
    return urls

# Test with different inputs
test_cases = [
    "Visit https://www.tutorialspoint.com for tutorials",
    "No links here just plain text",
    "Multiple links: https://example.com and http://test.org"
]

for i, text in enumerate(test_cases, 1):
    result = extract_urls(text)
    print(f"Test {i}: {result}")

The output demonstrates handling of different scenarios −

Test 1: ['https://www.tutorialspoint.com']
Test 2: No URLs found in the text
Test 3: ['https://example.com', 'http://test.org']

Conclusion

Python's re module provides powerful tools for extracting URLs from HTML content using regular expressions. The key techniques include using re.findall() with appropriate patterns to match HTTP/HTTPS URLs, extracting specific components like protocol and hostname, and handling various URL formats found in HTML links.

Updated on: 2026-03-16T21:38:53+05:30

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements