How can BeautifulSoup be used to extract ‘href’ links from a website?


BeautifulSoup is a third party Python library that is used to parse data from web pages. It helps in web scraping, which is a process of extracting, using, and manipulating the data from different resources.

Web scraping can also be used to extract data for research purposes, understand/compare market trends, perform SEO monitoring, and so on.

The below line can be run to install BeautifulSoup on Windows −

pip install beautifulsoup4

Following is an example −

Example

from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Algorithm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")
for link in soup.find_all('a'):
   print(link.get('href'))

Output

The href links are :
…
https://stats.wikimedia.org/#/en.wikipedia.org
https://foundation.wikimedia.org/wiki/Cookie_statement
https://wikimediafoundation.org/
https://www.mediawiki.org/

Explanation

  • The required packages are imported, and aliased.

  • The website is defined.

  • The url is opened, and data is read from it.

  • The ‘BeautifulSoup’ function is used to extract text from the webpage.

  • The ‘find_all’ function is used to extract text from the webpage data.

  • The href links are printed on the console.

Updated on: 18-Jan-2021

11K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements