Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to use Boto3 to paginate through all objects of a S3 bucket present in AWS Glue
Boto3 is a Python SDK for AWS that allows you to interact with AWS services. When working with large S3 buckets containing thousands of objects, pagination helps retrieve data in manageable chunks rather than loading everything at once.
Why Use Pagination?
S3 buckets can contain millions of objects. Loading all objects at once can cause memory issues and timeouts. Pagination allows you to:
Process objects in smaller batches
Reduce memory usage
Handle large datasets efficiently
Resume from a specific point using tokens
Key Parameters
The pagination function uses these important parameters:
max_items ? Total number of records to return across all pages
page_size ? Number of items per page (default: 1000)
starting_token ? Token to resume pagination from a previous response
Basic S3 Object Pagination
Here's a complete example to paginate through S3 objects ?
import boto3
from botocore.exceptions import ClientError
def paginate_s3_objects(bucket_name, max_items=None, page_size=None, starting_token=None):
"""
Paginate through all objects in an S3 bucket
Args:
bucket_name (str): Name of the S3 bucket
max_items (int): Maximum total items to return
page_size (int): Items per page
starting_token (str): Token to continue pagination
Returns:
Paginator response iterator
"""
try:
# Create S3 client
session = boto3.session.Session()
s3_client = session.client('s3')
# Create paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Set pagination config
pagination_config = {}
if max_items:
pagination_config['MaxItems'] = max_items
if page_size:
pagination_config['PageSize'] = page_size
if starting_token:
pagination_config['StartingToken'] = starting_token
# Paginate through objects
if pagination_config:
response = paginator.paginate(
Bucket=bucket_name,
PaginationConfig=pagination_config
)
else:
response = paginator.paginate(Bucket=bucket_name)
return response
except ClientError as e:
raise Exception(f"AWS Client error: {e}")
except Exception as e:
raise Exception(f"Unexpected error: {e}")
# Example usage - First batch
bucket_name = 'my-test-bucket'
paginator = paginate_s3_objects(bucket_name, max_items=5, page_size=2)
print("First batch of objects:")
for page in paginator:
if 'Contents' in page:
for obj in page['Contents']:
print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")
# Check if more pages exist
if 'NextToken' in page:
next_token = page['NextToken']
print(f"\nNext token: {next_token}")
break
Iterating Through All Pages
To process all objects in the bucket page by page ?
def process_all_s3_objects(bucket_name, page_size=1000):
"""
Process all objects in S3 bucket using pagination
"""
try:
session = boto3.session.Session()
s3_client = session.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
total_objects = 0
total_size = 0
# Iterate through all pages
for page in paginator.paginate(Bucket=bucket_name, PaginationConfig={'PageSize': page_size}):
if 'Contents' not in page:
continue
page_objects = len(page['Contents'])
total_objects += page_objects
print(f"\nProcessing page with {page_objects} objects:")
for obj in page['Contents']:
print(f" - {obj['Key']} ({obj['Size']} bytes)")
total_size += obj['Size']
print(f"\nSummary:")
print(f"Total objects: {total_objects}")
print(f"Total size: {total_size:,} bytes")
except ClientError as e:
print(f"AWS error: {e}")
except Exception as e:
print(f"Error: {e}")
# Process all objects with smaller page size
process_all_s3_objects('my-test-bucket', page_size=100)
Resume Pagination with Tokens
You can resume pagination from where you left off using continuation tokens ?
def resume_pagination_example(bucket_name):
"""
Demonstrate resuming pagination using tokens
"""
session = boto3.session.Session()
s3_client = session.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
# Get first batch
print("=== First batch (max 3 items) ===")
first_batch = paginator.paginate(
Bucket=bucket_name,
PaginationConfig={'MaxItems': 3, 'PageSize': 2}
)
next_token = None
for page in first_batch:
if 'Contents' in page:
for obj in page['Contents']:
print(f"Object: {obj['Key']}")
if 'NextToken' in page:
next_token = page['NextToken']
break
if next_token:
print(f"\n=== Resuming with token ===")
print(f"Token: {next_token}")
# Resume from token
resumed_batch = paginator.paginate(
Bucket=bucket_name,
PaginationConfig={
'MaxItems': 3,
'PageSize': 2,
'StartingToken': next_token
}
)
for page in resumed_batch:
if 'Contents' in page:
for obj in page['Contents']:
print(f"Resumed Object: {obj['Key']}")
# Example usage
resume_pagination_example('my-test-bucket')
Best Practices
When implementing S3 pagination, follow these guidelines:
Use list_objects_v2 instead of deprecated list_objects
Set appropriate page_size ? Default 1000 works well for most cases
Handle empty buckets ? Check if 'Contents' key exists in response
Store continuation tokens ? Save tokens to resume processing later
Implement error handling ? Handle AWS service limits and network issues
Comparison
| Method | Use Case | Memory Usage | Best For |
|---|---|---|---|
list_objects_v2 (all) |
Small buckets | High | < 1000 objects |
| Pagination | Large buckets | Low | > 1000 objects |
| With tokens | Resume processing | Low | Long-running jobs |
Conclusion
Use Boto3 pagination to efficiently process large S3 buckets without memory issues. Configure page_size based on your needs and use continuation tokens to resume interrupted processing. Always use list_objects_v2 for better performance and features.
