How can BeautifulSoup package be used to parse data from a webpage in Python?

PythonServer Side ProgrammingProgramming

BeautifulSoup is a third party Python library that is used to parse data from web pages. It helps in web scraping, which is a process of extracting, using, and manipulating the data from different resources.

Web scraping can also be used to extract data for research purposes, understand/compare market trends, perform SEO monitoring, and so on.

The below line can be run to install BeautifulSoup on Windows −

pip install beautifulsoup4

Let us see an example −

Example

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib
url = 'https://en.wikipedia.org/wiki/Algorithm'
html = urlopen(url).read()
print("Reading the webpage...")
soup = BeautifulSoup(html, features="html.parser")
print("Parsing the webpage...")
for script in soup(["script", "style"]):
   script.extract() # rip it out
print("Extracting text from the webpage...")
text = soup.get_text()
print("Data cleaning...")
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text = str(text)
print(text)

Output

Reading the webpage...
Parsing the webpage...
Extracting text from the webpage...
Data cleaning...
Recursive C implementation of Euclid's algorithm from the above flowchart
Recursion
A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition (also known as termination condition) matches, which is a method common to functional programming….
…..
Developers
Statistics
Cookie statement

Explanation

  • The required packages are imported, and aliased.

  • The website is defined.

  • The url is opened, and the ‘script’ tag and other irrelevant HTML tags are removed.

  • The ‘get_text’ function is used to extract text from the webpage data.

  • The extra spaces and invalid words are eliminated.

  • The text is printed on the console.

raja
Published on 18-Jan-2021 17:22:22
Advertisements