- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - Navigating by Tags
One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.
Easiest way to search a parse tree is to search the tag by its name.
soup.head
The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.
Consider the following HTML page to be scraped: <html> <head> <title>TutorialsPoint</title> <script> document.write("Welcome to TutorialsPoint"); </script> </head> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It's all Free</b></p> </body> </html>
Following code extracts the contents of <head> element
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') print(soup.head)
Output
<head> <title>TutorialsPoint</title> <script> document.write("Welcome to TutorialsPoint"); </script> </head>
soup.body
Similarly, to return the contents of body part of HTML page, use soup.body
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') print (soup.body)
Output
<body> <h1>Tutorialspoint Online Library</h1> <p><b>It's all Free</b></p> </body>
You can also extract specific tag (like first <h1> tag) in the <body> tag.
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') print(soup.body.h1)
Output
<h1>Tutorialspoint Online Library</h1>
soup.p
Our HTML file contains a <p> tag. We can extract the contents of this tag
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') print(soup.p)
Output
<p><b>It's all Free</b></p>
Tag.contents
A Tag object may have one or more PageElements. The Tag object's contents property returns a list of all elements included in it.
Let us find the elements in <head> tag of our index.html file.
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') tag = soup.head print (tag.contents)
Output
['\n', <title>TutorialsPoint</title>, '\n', <script> document.write("Welcome to TutorialsPoint"); </script>, '\n']
Tag.children
The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.
The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.
To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.
<html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html>
The following Python code gives a list of all the children elements of top level <ul> tag.
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') tag = soup.ul print (list(tag.children))
Output
['\n', <li>Accounts</li>, '\n', <ul> <li>Anand</li> <li>Mahesh</li> </ul>, '\n', <li>HR</li>, '\n', <ul> <li>Rani</li> <li>Ankita</li> </ul>, '\n']
Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.
Example
for child in tag.children: print (child)
Output
<li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul>
Tag.find_all()
This method returns a result set of contents of all the tags matching with the argument tag provided.
Let us consider the following HTML page(index.html) for this −
<html> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It's all Free</b></p> <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a> <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a> <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a> <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a> <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a> </body> </html>
The following code lists all the elements with <a> tag
Example
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') result = soup.find_all("a") print (result)
Output
[ <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a> ]
To Continue Learning Please Login
Login with Google