Article Categories

Selected Reading

How to use Boto3 to paginate through all objects of a S3 bucket present in AWS Glue

AWS Boto3 Python Server Side Programming Programming

Boto3 is a Python SDK for AWS that allows you to interact with AWS services. When working with large S3 buckets containing thousands of objects, pagination helps retrieve data in manageable chunks rather than loading everything at once.

Why Use Pagination?

S3 buckets can contain millions of objects. Loading all objects at once can cause memory issues and timeouts. Pagination allows you to:

Process objects in smaller batches
Reduce memory usage
Handle large datasets efficiently
Resume from a specific point using tokens

Key Parameters

The pagination function uses these important parameters:

max_items ? Total number of records to return across all pages
page_size ? Number of items per page (default: 1000)
starting_token ? Token to resume pagination from a previous response

Basic S3 Object Pagination

Here's a complete example to paginate through S3 objects ?

import boto3
from botocore.exceptions import ClientError

def paginate_s3_objects(bucket_name, max_items=None, page_size=None, starting_token=None):
    """
    Paginate through all objects in an S3 bucket
    
    Args:
        bucket_name (str): Name of the S3 bucket
        max_items (int): Maximum total items to return
        page_size (int): Items per page
        starting_token (str): Token to continue pagination
    
    Returns:
        Paginator response iterator
    """
    try:
        # Create S3 client
        session = boto3.session.Session()
        s3_client = session.client('s3')
        
        # Create paginator
        paginator = s3_client.get_paginator('list_objects_v2')
        
        # Set pagination config
        pagination_config = {}
        if max_items:
            pagination_config['MaxItems'] = max_items
        if page_size:
            pagination_config['PageSize'] = page_size
        if starting_token:
            pagination_config['StartingToken'] = starting_token
        
        # Paginate through objects
        if pagination_config:
            response = paginator.paginate(
                Bucket=bucket_name,
                PaginationConfig=pagination_config
            )
        else:
            response = paginator.paginate(Bucket=bucket_name)
            
        return response
        
    except ClientError as e:
        raise Exception(f"AWS Client error: {e}")
    except Exception as e:
        raise Exception(f"Unexpected error: {e}")

# Example usage - First batch
bucket_name = 'my-test-bucket'
paginator = paginate_s3_objects(bucket_name, max_items=5, page_size=2)

print("First batch of objects:")
for page in paginator:
    if 'Contents' in page:
        for obj in page['Contents']:
            print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")
    
    # Check if more pages exist
    if 'NextToken' in page:
        next_token = page['NextToken']
        print(f"\nNext token: {next_token}")
        break

Iterating Through All Pages

To process all objects in the bucket page by page ?

def process_all_s3_objects(bucket_name, page_size=1000):
    """
    Process all objects in S3 bucket using pagination
    """
    try:
        session = boto3.session.Session()
        s3_client = session.client('s3')
        
        paginator = s3_client.get_paginator('list_objects_v2')
        
        total_objects = 0
        total_size = 0
        
        # Iterate through all pages
        for page in paginator.paginate(Bucket=bucket_name, PaginationConfig={'PageSize': page_size}):
            if 'Contents' not in page:
                continue
                
            page_objects = len(page['Contents'])
            total_objects += page_objects
            
            print(f"\nProcessing page with {page_objects} objects:")
            
            for obj in page['Contents']:
                print(f"  - {obj['Key']} ({obj['Size']} bytes)")
                total_size += obj['Size']
        
        print(f"\nSummary:")
        print(f"Total objects: {total_objects}")
        print(f"Total size: {total_size:,} bytes")
        
    except ClientError as e:
        print(f"AWS error: {e}")
    except Exception as e:
        print(f"Error: {e}")

# Process all objects with smaller page size
process_all_s3_objects('my-test-bucket', page_size=100)

Resume Pagination with Tokens

You can resume pagination from where you left off using continuation tokens ?

def resume_pagination_example(bucket_name):
    """
    Demonstrate resuming pagination using tokens
    """
    session = boto3.session.Session()
    s3_client = session.client('s3')
    paginator = s3_client.get_paginator('list_objects_v2')
    
    # Get first batch
    print("=== First batch (max 3 items) ===")
    first_batch = paginator.paginate(
        Bucket=bucket_name,
        PaginationConfig={'MaxItems': 3, 'PageSize': 2}
    )
    
    next_token = None
    for page in first_batch:
        if 'Contents' in page:
            for obj in page['Contents']:
                print(f"Object: {obj['Key']}")
        
        if 'NextToken' in page:
            next_token = page['NextToken']
            break
    
    if next_token:
        print(f"\n=== Resuming with token ===")
        print(f"Token: {next_token}")
        
        # Resume from token
        resumed_batch = paginator.paginate(
            Bucket=bucket_name,
            PaginationConfig={
                'MaxItems': 3, 
                'PageSize': 2,
                'StartingToken': next_token
            }
        )
        
        for page in resumed_batch:
            if 'Contents' in page:
                for obj in page['Contents']:
                    print(f"Resumed Object: {obj['Key']}")

# Example usage
resume_pagination_example('my-test-bucket')

Best Practices

When implementing S3 pagination, follow these guidelines:

Use list_objects_v2 instead of deprecated list_objects
Set appropriate page_size ? Default 1000 works well for most cases
Handle empty buckets ? Check if 'Contents' key exists in response
Store continuation tokens ? Save tokens to resume processing later
Implement error handling ? Handle AWS service limits and network issues

Comparison

Method	Use Case	Memory Usage	Best For
`list_objects_v2` (all)	Small buckets	High	< 1000 objects
Pagination	Large buckets	Low	> 1000 objects
With tokens	Resume processing	Low	Long-running jobs

Conclusion

Use Boto3 pagination to efficiently process large S3 buckets without memory issues. Configure page_size based on your needs and use continuation tokens to resume interrupted processing. Always use list_objects_v2 for better performance and features.

Ashish Anand

Updated on: 2026-03-25T19:00:19+05:30

4K+ Views

Previous Next