Extract the title from a webpage using Python

Python Server Side Programming Programming

In Python, we can extract the title from a webpage using Web scraping. Web scraping is the process of extracting data from a website or webpage. In this article, we will scrap the title of a webpage using the Requests and BeautifulSoup libraries in Python.

Extracting Title from Webpage

Method 1: Using Request and Beautiful Soup libraries

We can use the request and Beautiful Soup libraries of Python to extract titles from a webpage. The requests library is used to send HTTP requests to a website and get its response. We then use the response object to extract the HTML content of the webpage.

Example

In the below example, we extract the title of the Wikipedia Homepage. We send the GET request to the Wikipedia page URL using the request library and store the response object in the response variable.

We can then use the Beautiful Soup object to parse the HTML content received in the response object and extract the title tag of the webpage using the soup.title attribute. We can then extract the string attribute and store it in the title variable.

import requests
from bs4 import BeautifulSoup

url = 'https://www.wikipedia.org/'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string

print(title)

Output

Wikipedia

Method 2: Extracting title using urllib and BeautifulSoup

The urllib and BeautifulSoup method is used to extract the title from a webpage by opening the URL and retrieving the HTML content of the webpage using the urllib library. A BeautifulSoup object is created using the HTML content, and the title tag of the webpage can be extracted using the 'soup.title' attribute.

Example

In the below example, we are using the urllib library to open the URL and retrieve the HTML content of the webpage. We then create a BeautifulSoup object with the HTML content of the webpage using the 'html.parser' parser.

We can then extract the title tag of the webpage using the 'soup.title' attribute. Finally, we extract the string content of the title tag using the 'string' attribute and store it in the 'title' variable. We then print the title of the webpage to the console.

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.wikipedia.org/'
html_page = urlopen(url)
soup = BeautifulSoup(html_page, 'html.parser')
title = soup.title.string

print(title)

Output

Wikipedia

Method 3: Extracting title using selenium and BeautifulSoup

The selenium and BeautifulSoup method is used to extract the title from a webpage by using the selenium library to open the URL and retrieve the HTML content of the webpage. A Chrome webdriver is created and used to navigate to the webpage. The HTML content of the webpage is retrieved using the 'page_source' attribute of the webdriver. A BeautifulSoup object is created using the HTML content, and the title tag of the webpage can be extracted using the 'soup.title' attribute.

Example

In the below example, we are using the selenium library to open the URL and retrieve the HTML content of the webpage. We create a Chrome webdriver and use it to navigate to the webpage. We then retrieve the HTML content of the webpage using the 'page_source' attribute of the webdriver.

We create a BeautifulSoup object with the HTML content of the webpage using the 'html.parser' parser. We can then extract the title tag of the webpage using the 'soup.title' attribute. Finally, we extract the string content of the title tag using the 'string' attribute and store it in the 'title' variable. We then print the title of the webpage to the console.

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.wikipedia.org/'
driver = webdriver.Chrome()
driver.get(url)

html_page = driver.page_source
soup = BeautifulSoup(html_page, 'html.parser')
title = soup.title.string

print(title)

driver.quit()

Output

Wikipedia

Method 4: Extracting title using regular expressions

The regular expressions method is used to extract the title from a webpage by sending a GET request to the URL using the requests library and storing the response object. The HTML content of the webpage is then decoded and stored in a variable. A regular expression pattern is defined to match the title tag of the webpage. The 'search' method of the regular expression pattern is used to find the first match of the pattern in the HTML content of the webpage. The string content of the first matched group can be extracted using the 'group(1)' method, and the title of the webpage can be obtained.

Example

In the below example, we are using regular expressions to extract the title of the webpage. We send a GET request to the URL using the requests library and store the response object in the 'response' variable.

We then decode the HTML content of the webpage using the 'utf-8' encoding and store it in the 'html_content' variable. We define a regular expression pattern to match the title tag of the webpage.

We use the 'search' method of the regular expression pattern to find the first match of the pattern in the HTML content of the webpage. We extract the string content of the first matched group using the 'group(1)' method and store it in the 'title' variable. We then print the title of the webpage to the console.

import requests

url = 'https://www.wikipedia.org/'
response = requests.get(url)
html_content = response.content.decode('utf-8')

title_pattern = re.compile('(.+?)')
match = title_pattern.search(html_content)
title = match.group(1)

print(title)

Output

Wikipedia

Conclusion

In this article, we discussed the process to extract the title from a webpage using requests and the Beautiful Soup library in Python. The request library is used to send HTTP requests to the website URL and get the HTML content as a response. The Beautiful Soup library is then used to parse the HTML content and extract the required title from the HTML content.

Rohan Singh

Updated on: 2023-07-10T13:37:58+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started