- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - Installation
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
BeautifulSoup package is not a part of Python's standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python's recommended method.
A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.
We shall use venv module in Python's standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.
Use the following command to create virtual environment in Windows
C:\uses\user\>python -m venv myenv
On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment
mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y mvl@GNVBGL3:~ $ sudo apt install python3-venv
Then use the following command to create a virtual environment
mvl@GNVBGL3:~ $ sudo python3 -m venv myenv
You need to activate the virtual environment. On Windows use the command
C:\uses\user\>cd myenv C:\uses\user\myenv>scripts\activate (myenv) C:\Users\users\user\myenv>
On Ubuntu Linux, use following command to activate the virtual environment
mvl@GNVBGL3:~$ cd myenv mvl@GNVBGL3:~/myenv$ source bin/activate (myenv) mvl@GNVBGL3:~/myenv$
Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.
(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4 Collecting beautifulsoup4 Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.0/143.0 KB 325.2 kB/s eta 0:00:00 Collecting soupsieve>1.2 Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB) Installing collected packages: soupsieve, beautifulsoup4 Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1
Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.
If you don't have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install
To check if Beautifulsoup is properly install, enter following commands in Python terminal −
>>> import bs4 >>> bs4.__version__ '4.12.2'
If the installation hasn't been successful, you will get ModuleNotFoundError.
You will also need to install requests library. It is a HTTP library for Python.
pip3 install requests
Installing a Parser
By default, Beautiful Soup supports the HTML parser included in Python's standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.
To install lxml or html5lib parser, use the command:
pip3 install lxml pip3 install html5lib
These parsers have their advantages and disadvantages as shown below −
Parser: Python's html.parser
Usage − BeautifulSoup(markup, "html.parser")
Advantages
- Batteries included
- Decent speed
- Lenient (As of Python 3.2)
Disadvantages
- Not as fast as lxml, less lenient than html5lib.
Parser: lxml's HTML parser
Usage − BeautifulSoup(markup, "lxml")
Advantages
- Very fast
- Lenient
Disadvantages
-
External C dependency
Parser: lxml's XML parser
Usage − BeautifulSoup(markup, "lxml-xml")
Or BeautifulSoup(markup, "xml")
Advantages
- Very fast
- The only currently supported XML parser
Disadvantages
- External C dependency
Parser: html5lib
Usage − BeautifulSoup(markup, "html5lib")
Advantages
- Extremely lenient
- Parses pages the same way a web browser does
- Creates valid HTML5
Disadvantages
- Very slow
- External Python dependency