Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Selected Reading

Beautiful Soup - Installation

Quiz

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

BeautifulSoup package is not a part of Python's standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python's recommended method.

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

We shall use venv module in Python's standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.

Use the following command to create virtual environment in Windows

D:\beautiful_soup>py -m venv myenv

On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment

mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y
mvl@GNVBGL3:~ $ sudo apt install python3-venv

Then use the following command to create a virtual environment

mvl@GNVBGL3:~ $ sudo python3 -m venv myenv

You need to activate the virtual environment. On Windows use the command

D:\beautiful_soup>cd myenv
D:\beautiful_soupmyenv>scripts\activate
(myenv) D:\beautiful_soup\myenv>

On Ubuntu Linux, use following command to activate the virtual environment

mvl@GNVBGL3:~$ cd myenv
mvl@GNVBGL3:~/myenv$ source bin/activate
(myenv) mvl@GNVBGL3:~/myenv$

Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.

D:\beautiful_soup\myenv\pythonProject> pip3 install beautifulsoup4
     
Collecting beautifulsoup4
  Obtaining dependency information for beautifulsoup4 from https://files.pythonhosted.org/packages/1a/39/47f9197bdd44df24d67ac8893641e16f386c984a0619ef2ee4c51fbbc019/beautifulsoup4-4.14.3-py3-none-any.whl.metadata
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4)
  Obtaining dependency information for soupsieve>=1.6.1 from https://files.pythonhosted.org/packages/14/a0/bb38d3b76b8cae341dad93a2dd83ab7462e6dbcdd84d43f54ee60a8dc167/soupsieve-2.8-py3-none-any.whl.metadata
  Downloading soupsieve-2.8-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4)
  Obtaining dependency information for typing-extensions>=4.0.0 from https://files.pythonhosted.org/packages/18/67/36e9267722cc04a6b9f15c7f3441c2363321a3ea07da7ae0c0707beb2a9c/typing_extensions-4.15.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading beautifulsoup4-4.14.3-py3-none-any.whl (107 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.7/107.7 kB 3.1 MB/s eta 0:00:00
Downloading soupsieve-2.8-py3-none-any.whl (36 kB)
Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB ? eta 0:00:00
Installing collected packages: typing-extensions, soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.14.3 soupsieve-2.8 typing-extensions-4.15.0

Note that the latest version of Beautifulsoup4 is 4.14.3 and requires Python 3.8 or later.

If you don't have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install

To check if Beautifulsoup is properly install, enter following commands in Python terminal −

>>> import bs4
>>> bs4.__version__
'4.14.3'

If the installation hasn't been successful, you will get ModuleNotFoundError.

You will also need to install requests library. It is a HTTP library for Python.

pip3 install requests

Installing a Parser

By default, Beautiful Soup supports the HTML parser included in Python's standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To install lxml or html5lib parser, use the command:

pip3 install lxml
pip3 install html5lib

These parsers have their advantages and disadvantages as shown below −

Parser: Python's html.parser

Usage − BeautifulSoup(markup, "html.parser")

Advantages

Batteries included
Decent speed
Lenient (As of Python 3.2)

Disadvantages

Not as fast as lxml, less lenient than html5lib.

Parser: lxml's HTML parser

Usage − BeautifulSoup(markup, "lxml")

Advantages

Very fast
Lenient

Disadvantages

External C dependency

Parser: lxml's XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Or BeautifulSoup(markup, "xml")

Advantages

Very fast
The only currently supported XML parser

Disadvantages

External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Advantages

Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5

Disadvantages

Very slow
External Python dependency

Previous Quiz Next