Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Extract the title from a webpage using Python
In Python, we can extract the title from a webpage using web scraping. Web scraping is the process of extracting data from a website or webpage. In this article, we will scrape the title of a webpage using various Python libraries including Requests, BeautifulSoup, urllib, Selenium, and regular expressions.
Method 1: Using Requests and BeautifulSoup
The most common approach uses the requests library to send HTTP requests and BeautifulSoup to parse HTML content. The requests library fetches the webpage, and BeautifulSoup extracts the title tag.
Example
In the below example, we extract the title of the Wikipedia homepage. We send a GET request to the URL and parse the HTML response ?
import requests from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') title = soup.title.string print(title)
Wikipedia
Method 2: Using urllib and BeautifulSoup
This method uses urllib (built into Python) instead of requests. The urllib library opens the URL directly and retrieves the HTML content, which is then parsed by BeautifulSoup.
Example
Here we use urllib.request.urlopen() to fetch the webpage content ?
from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' html_page = urlopen(url) soup = BeautifulSoup(html_page, 'html.parser') title = soup.title.string print(title)
Wikipedia
Method 3: Using Selenium and BeautifulSoup
Selenium is useful for JavaScriptheavy websites where the title might be dynamically generated. It opens a real browser, loads the page completely, then extracts the HTML source.
Example
This approach uses Chrome WebDriver to load the page and get the rendered HTML ?
from selenium import webdriver from bs4 import BeautifulSoup url = 'https://www.wikipedia.org/' driver = webdriver.Chrome() driver.get(url) html_page = driver.page_source soup = BeautifulSoup(html_page, 'html.parser') title = soup.title.string print(title) driver.quit()
Wikipedia
Method 4: Using Regular Expressions
Regular expressions can extract the title directly from HTML text without parsing the entire document. This method is faster but less reliable for complex HTML structures.
Example
We use a regex pattern to match the title tags in the HTML content ?
import requests
import re
url = 'https://www.wikipedia.org/'
response = requests.get(url)
html_content = response.content.decode('utf-8')
title_pattern = re.compile('<title>(.+?)</title>')
match = title_pattern.search(html_content)
title = match.group(1)
print(title)
Wikipedia
Comparison of Methods
| Method | Best For | Dependencies | JavaScript Support |
|---|---|---|---|
| Requests + BeautifulSoup | Static websites | requests, beautifulsoup4 | No |
| urllib + BeautifulSoup | No external dependencies | beautifulsoup4 only | No |
| Selenium + BeautifulSoup | JavaScriptheavy sites | selenium, webdriver | Yes |
| Regular Expressions | Simple HTML, speed | requests only | No |
Conclusion
Use Requests + BeautifulSoup for most static websites as it's reliable and efficient. Choose Selenium when dealing with JavaScriptrendered content, and use regular expressions only for simple HTML structures where performance is critical.
