How to use Boto3 to paginate through all objects of a S3 bucket present in AWS Glue

Boto3 is a Python SDK for AWS that allows you to interact with AWS services. When working with large S3 buckets containing thousands of objects, pagination helps retrieve data in manageable chunks rather than loading everything at once.

Why Use Pagination?

S3 buckets can contain millions of objects. Loading all objects at once can cause memory issues and timeouts. Pagination allows you to:

  • Process objects in smaller batches

  • Reduce memory usage

  • Handle large datasets efficiently

  • Resume from a specific point using tokens

Key Parameters

The pagination function uses these important parameters:

  • max_items ? Total number of records to return across all pages

  • page_size ? Number of items per page (default: 1000)

  • starting_token ? Token to resume pagination from a previous response

Basic S3 Object Pagination

Here's a complete example to paginate through S3 objects ?

import boto3
from botocore.exceptions import ClientError

def paginate_s3_objects(bucket_name, max_items=None, page_size=None, starting_token=None):
    """
    Paginate through all objects in an S3 bucket
    
    Args:
        bucket_name (str): Name of the S3 bucket
        max_items (int): Maximum total items to return
        page_size (int): Items per page
        starting_token (str): Token to continue pagination
    
    Returns:
        Paginator response iterator
    """
    try:
        # Create S3 client
        session = boto3.session.Session()
        s3_client = session.client('s3')
        
        # Create paginator
        paginator = s3_client.get_paginator('list_objects_v2')
        
        # Set pagination config
        pagination_config = {}
        if max_items:
            pagination_config['MaxItems'] = max_items
        if page_size:
            pagination_config['PageSize'] = page_size
        if starting_token:
            pagination_config['StartingToken'] = starting_token
        
        # Paginate through objects
        if pagination_config:
            response = paginator.paginate(
                Bucket=bucket_name,
                PaginationConfig=pagination_config
            )
        else:
            response = paginator.paginate(Bucket=bucket_name)
            
        return response
        
    except ClientError as e:
        raise Exception(f"AWS Client error: {e}")
    except Exception as e:
        raise Exception(f"Unexpected error: {e}")

# Example usage - First batch
bucket_name = 'my-test-bucket'
paginator = paginate_s3_objects(bucket_name, max_items=5, page_size=2)

print("First batch of objects:")
for page in paginator:
    if 'Contents' in page:
        for obj in page['Contents']:
            print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")
    
    # Check if more pages exist
    if 'NextToken' in page:
        next_token = page['NextToken']
        print(f"\nNext token: {next_token}")
        break

Iterating Through All Pages

To process all objects in the bucket page by page ?

def process_all_s3_objects(bucket_name, page_size=1000):
    """
    Process all objects in S3 bucket using pagination
    """
    try:
        session = boto3.session.Session()
        s3_client = session.client('s3')
        
        paginator = s3_client.get_paginator('list_objects_v2')
        
        total_objects = 0
        total_size = 0
        
        # Iterate through all pages
        for page in paginator.paginate(Bucket=bucket_name, PaginationConfig={'PageSize': page_size}):
            if 'Contents' not in page:
                continue
                
            page_objects = len(page['Contents'])
            total_objects += page_objects
            
            print(f"\nProcessing page with {page_objects} objects:")
            
            for obj in page['Contents']:
                print(f"  - {obj['Key']} ({obj['Size']} bytes)")
                total_size += obj['Size']
        
        print(f"\nSummary:")
        print(f"Total objects: {total_objects}")
        print(f"Total size: {total_size:,} bytes")
        
    except ClientError as e:
        print(f"AWS error: {e}")
    except Exception as e:
        print(f"Error: {e}")

# Process all objects with smaller page size
process_all_s3_objects('my-test-bucket', page_size=100)

Resume Pagination with Tokens

You can resume pagination from where you left off using continuation tokens ?

def resume_pagination_example(bucket_name):
    """
    Demonstrate resuming pagination using tokens
    """
    session = boto3.session.Session()
    s3_client = session.client('s3')
    paginator = s3_client.get_paginator('list_objects_v2')
    
    # Get first batch
    print("=== First batch (max 3 items) ===")
    first_batch = paginator.paginate(
        Bucket=bucket_name,
        PaginationConfig={'MaxItems': 3, 'PageSize': 2}
    )
    
    next_token = None
    for page in first_batch:
        if 'Contents' in page:
            for obj in page['Contents']:
                print(f"Object: {obj['Key']}")
        
        if 'NextToken' in page:
            next_token = page['NextToken']
            break
    
    if next_token:
        print(f"\n=== Resuming with token ===")
        print(f"Token: {next_token}")
        
        # Resume from token
        resumed_batch = paginator.paginate(
            Bucket=bucket_name,
            PaginationConfig={
                'MaxItems': 3, 
                'PageSize': 2,
                'StartingToken': next_token
            }
        )
        
        for page in resumed_batch:
            if 'Contents' in page:
                for obj in page['Contents']:
                    print(f"Resumed Object: {obj['Key']}")

# Example usage
resume_pagination_example('my-test-bucket')

Best Practices

When implementing S3 pagination, follow these guidelines:

  • Use list_objects_v2 instead of deprecated list_objects

  • Set appropriate page_size ? Default 1000 works well for most cases

  • Handle empty buckets ? Check if 'Contents' key exists in response

  • Store continuation tokens ? Save tokens to resume processing later

  • Implement error handling ? Handle AWS service limits and network issues

Comparison

Method Use Case Memory Usage Best For
list_objects_v2 (all) Small buckets High < 1000 objects
Pagination Large buckets Low > 1000 objects
With tokens Resume processing Low Long-running jobs

Conclusion

Use Boto3 pagination to efficiently process large S3 buckets without memory issues. Configure page_size based on your needs and use continuation tokens to resume interrupted processing. Always use list_objects_v2 for better performance and features.

Updated on: 2026-03-25T19:00:19+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements