Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Fetch top 10 starred repositories of user on GitHub using Python?
GitHub is the world's largest platform for version control and collaborative development. You can scrape GitHub's trending repositories page to fetch the top 10 most starred repositories within a specific timeframe using Python's requests and BeautifulSoup libraries.
This tutorial demonstrates how to scrape GitHub's trending page, extract repository information, and save the results to a file with proper formatting.
Required Libraries
First, ensure you have the necessary libraries installed ?
pip install requests beautifulsoup4 lxml
Complete Implementation
Here's the complete code to fetch and display the top 10 trending repositories ?
import requests
from bs4 import BeautifulSoup
# Fetch the trending repositories page
r = requests.get('https://github.com/trending/python?since=monthly')
bs = BeautifulSoup(r.text, 'lxml')
# Find all repository containers
repo_containers = bs.find_all('article', class_='Box-row')
# Open file to store results
with open('starred-repos.txt', 'w') as f1:
# Write header
f1.write('{}\t{}\t\t{}\n\n'.format('Position', 'Owner', 'Repository'))
# Process top 10 repositories
for i, container in enumerate(repo_containers[:10]):
# Extract repository link
repo_link = container.find('h2').find('a')
if repo_link:
href = repo_link.get('href')
# Split the href to get owner and repo name
parts = href.strip('/').split('/')
if len(parts) >= 2:
owner = parts[0]
repo_name = parts[1]
# Write to file
f1.write('{}.\t{}\t\t{}\n'.format(i + 1, owner, repo_name))
# Read and display the results
print("Top 10 Trending Python Repositories:")
print("-" * 50)
with open('starred-repos.txt', 'r') as f1:
print(f1.read())
How the Code Works
The script follows these key steps:
-
Web Scraping: Uses
requests.get()to fetch the GitHub trending page - HTML Parsing: BeautifulSoup parses the HTML content with the lxml parser
- Data Extraction: Finds repository containers and extracts owner/repository names from href attributes
- File Operations: Saves formatted results to a text file and displays them
Alternative Approach Using GitHub API
For more reliable data access, consider using the GitHub API instead of web scraping ?
import requests
import json
from datetime import datetime, timedelta
# Calculate date for last month
last_month = (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d')
# GitHub API endpoint for searching repositories
url = f'https://api.github.com/search/repositories?q=created:>{last_month}&sort=stars&order=desc&per_page=10'
response = requests.get(url)
data = response.json()
print("Top 10 Most Starred Repositories (Last Month):")
print("-" * 50)
for i, repo in enumerate(data['items'], 1):
print(f"{i}. {repo['owner']['login']}/{repo['name']} - {repo['stargazers_count']} stars")
Key Points
- Web Scraping: GitHub's HTML structure may change, making scraped code fragile
- API Approach: More reliable and provides structured JSON data
- Rate Limits: GitHub API has rate limits; consider authentication for higher limits
- Error Handling: Add try-except blocks for production use
Expected Output Format
Position Owner Repository 1. microsoft vscode 2. tensorflow tensorflow 3. facebook react 4. vuejs vue 5. angular angular 6. nodejs node 7. kubernetes kubernetes 8. moby moby 9. golang go 10. atom atom
Conclusion
This tutorial shows how to scrape GitHub's trending page using BeautifulSoup and save the results to a file. For production applications, consider using the GitHub API for more reliable and structured data access.
