Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Fetching text from Wikipedia's Infobox in Python
In this article, we are going to scrape the text from Wikipedia's Infobox using BeautifulSoup and requests in Python. We can do it in 10 minutes. It's straightforward and useful for extracting structured information from Wikipedia pages.
Prerequisites
We need to install bs4 and requests. Execute the below commands to install ?
pip install beautifulsoup4 pip install requests
Steps to Extract Infobox Data
Follow the below steps to write the code to fetch the text that we want from the infobox ?
- Import the bs4 and requests modules.
- Send an HTTP request to the page that you want to fetch data from using the requests.get() method.
- Parse the response text using bs4.BeautifulSoup class and store it in a variable.
- Go to the Wikipedia page and inspect the element that you want.
- Find element using a suitable method provided by bs4.
Example: Extracting India's Motto
Let's extract the motto from India's Wikipedia page. We'll target the infobox table and navigate to the specific row containing the motto ?
# importing the modules
import requests
import bs4
# URL
URL = "https://en.wikipedia.org/wiki/India"
# sending the request
response = requests.get(URL)
# parsing the response
soup = bs4.BeautifulSoup(response.text, 'html.parser')
# Now, we have parsed HTML with us. I want to get the motto from the wikipedia page.
# Elements structure
# table - class="infobox"
# 3rd tr to get motto
# getting infobox
infobox = soup.find('table', {'class': 'infobox'})
# getting 3rd row element tr
third_tr = infobox.find_all('tr')[2]
# from third_tr we have to find first 'a' element and 'div' element to get required data
first_a = third_tr.div.find('a')
div = third_tr.div.div
# motto
motto = f"{first_a.text} {div.text[:len(div.text) - 3]}"
# printing the motto
print(motto)
If you run the above program, you will get the following result ?
Satyameva Jayate "Truth Alone Triumphs"
How It Works
The code works by targeting the infobox table structure. Wikipedia infoboxes have consistent HTML structure where information is organized in table rows. We locate the specific row containing the motto and extract the text from nested elements.
Key Points
- Always use 'html.parser' as the parser for better compatibility
- Wikipedia's infobox structure may vary between pages
- Inspect the HTML structure before writing extraction code
- Handle potential missing elements with try-except blocks for robust scraping
Conclusion
You can extract any data from Wikipedia infoboxes by inspecting the HTML structure and targeting specific elements. This method works for extracting structured information like population, area, capitals, and other metadata from Wikipedia pages.
