Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Extract Wikipedia Data in Python?
Wikipedia is a vast source of information that can be programmatically accessed using Python. The wikipedia library provides a simple interface to extract content, summaries, and page details from Wikipedia articles.
Installing the Wikipedia Library
First, install the wikipedia library using pip ?
pip install wikipedia
Basic Wikipedia Data Extraction
Here's how to search for a topic and extract its summary ?
import wikipedia
# Search for a topic
results = wikipedia.search("Python Programming")
print("Search results:", results[:3])
# Get the page
page = wikipedia.page(results[0])
# Extract basic information
print("Title:", page.title)
print("URL:", page.url)
print("Summary (first 200 chars):")
print(page.summary[:200] + "...")
Search results: ['Python (programming language)', 'Programming language', 'Computer programming'] Title: Python (programming language) URL: https://en.wikipedia.org/wiki/Python_(programming_language) Summary (first 200 chars): Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically typed and garbage-colle...
Extracting Different Types of Data
The wikipedia library allows you to extract various types of information ?
import wikipedia
# Get a specific page
page = wikipedia.page("Artificial Intelligence")
# Extract different data types
print("Page Title:", page.title)
print("Categories:", page.categories[:3]) # First 3 categories
print("Links count:", len(page.links))
print("References count:", len(page.references))
# Get page content (first 300 characters)
print("Content preview:")
print(page.content[:300] + "...")
Page Title: Artificial intelligence Categories: ['Artificial intelligence', 'Computational fields of study', 'Computer science'] Links count: 1247 References count: 312 Content preview: Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment...
Creating a GUI Application
You can create a simple GUI to display Wikipedia data using tkinter ?
from tkinter import *
import tkinter as tk
import wikipedia
def get_wikipedia_summary():
try:
topic = entry.get()
if topic:
results = wikipedia.search(topic)
if results:
page = wikipedia.page(results[0])
summary = page.summary
text_widget.delete(1.0, tk.END)
text_widget.insert(tk.END, f"Title: {page.title}\n\n{summary}")
else:
text_widget.delete(1.0, tk.END)
text_widget.insert(tk.END, "No results found!")
except wikipedia.exceptions.DisambiguationError as e:
text_widget.delete(1.0, tk.END)
text_widget.insert(tk.END, f"Multiple pages found: {e.options[:5]}")
except Exception as e:
text_widget.delete(1.0, tk.END)
text_widget.insert(tk.END, f"Error: {str(e)}")
# Create GUI
win = Tk()
win.geometry("800x600")
win.title("Wikipedia Data Extractor")
# Input field
tk.Label(win, text="Enter topic:", font=("Arial", 12)).pack(pady=5)
entry = tk.Entry(win, width=50, font=("Arial", 10))
entry.pack(pady=5)
# Button
tk.Button(win, text="Get Summary", command=get_wikipedia_summary,
font=("Arial", 10)).pack(pady=5)
# Text display
text_widget = tk.Text(win, height=30, width=90, wrap=tk.WORD)
text_widget.pack(pady=10, padx=10, fill=tk.BOTH, expand=True)
win.mainloop()
Handling Common Issues
When working with Wikipedia data, you may encounter disambiguation pages or connection errors ?
import wikipedia
def safe_wikipedia_search(query):
try:
# Search for the topic
results = wikipedia.search(query, results=5)
if not results:
return "No results found"
# Try to get the first result
page = wikipedia.page(results[0])
return f"Title: {page.title}\nSummary: {page.summary[:200]}..."
except wikipedia.exceptions.DisambiguationError as e:
# Handle disambiguation
return f"Multiple options found: {e.options[:3]}"
except wikipedia.exceptions.PageError:
return "Page not found"
except Exception as e:
return f"Error occurred: {str(e)}"
# Test with different queries
queries = ["Python", "Java", "NonExistentTopic123"]
for query in queries:
print(f"Query: {query}")
result = safe_wikipedia_search(query)
print(result)
print("-" * 50)
Query: Python Multiple options found: ['Python (programming language)', 'Python (mythology)', 'Pythonidae'] -------------------------------------------------- Query: Java Multiple options found: ['Java', 'Java (programming language)', 'Java (island)'] -------------------------------------------------- Query: NonExistentTopic123 No results found --------------------------------------------------
Key Features
| Method | Purpose | Returns |
|---|---|---|
wikipedia.search() |
Search for topics | List of page titles |
wikipedia.page() |
Get page object | WikipediaPage object |
page.summary |
Get article summary | String |
page.content |
Get full article text | String |
Conclusion
The wikipedia library makes it easy to extract data from Wikipedia programmatically. Always handle exceptions like disambiguation errors and page not found scenarios for robust applications. You can integrate this with GUI frameworks like tkinter to create interactive Wikipedia data extractors.
