Extract URLs present in a given string


In the information age, it's common to encounter strings of text that contain URLs. As part of data cleaning or web scraping tasks, we often need to extract these URLs for further processing. In this article, we'll explore how to do this using C++, a high-performance language that offers fine-grained control over system resources.

Understanding URLs

A URL (Uniform Resource Locator) is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. In layman's terms, a URL is a web address.

Problem Statement

Given a string that contains several URLs, our task is to extract all the URLs present in the string.

Solution Approach

To solve this problem, we'll use the regular expression (regex) support in C++. Regular expressions are sequences of characters that define a search pattern, mainly for use in pattern matching with strings.

The steps involved in our approach are −

Define a Regex Pattern: Define a regex pattern that matches the general structure of a URL.

Match and Extract: Use the regex pattern to match and extract all URLs present in the given string.

C++ Implementation

Example

Here's the C++ code that implements our solution −

#include <bits/stdc++.h>
using namespace std;

// Function to extract all URLs from a string
vector<string> extractURLs(string str) {
   vector<string> urls;
   regex urlPattern("(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?");
   
   auto words_begin = sregex_iterator(str.begin(), str.end(), urlPattern);
   auto words_end = sregex_iterator();
   
   for (sregex_iterator i = words_begin; i != words_end; i++) {
      smatch match = *i;                                                     
      string match_str = match.str(); 
      urls.push_back(match_str);
   }
   
   return urls;
}

int main() {
   string str = "Visit https://www.tutorialspoint.com and http://www.tutorix.com for more information.";
   
   vector<string> urls = extractURLs(str);
   cout << "URLs found in the string:" << endl;
   for (string url : urls)
      cout << url << endl;
   
   return 0;
}

Output

URLs found in the string:
https://www.tutorialspoint.com and http
www.tutorix.com for more information.

Explanation

Let's consider the string −

str = "Visit https://www.tutorialspoint.com and http://www.tutorix.com for more information."

After applying our function to this string, it matches the two URLs and extracts them into a vector:

urls = ["https://www.tutorialspoint.com", "http://www.tutorix.com"]

This vector is the output of our program.

Conclusion

The task of extracting URLs from a string provides valuable insights into text processing and the use of regular expressions. This problem-solving approach, along with the C++ programming skills it requires, is highly useful in the fields of data analysis, web scraping, and software development.

Updated on: 17-May-2023

276 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements