Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to process excel files data in chunks with Python?
Processing large Excel files can consume significant memory and slow down your Python applications. When dealing with Excel spreadsheets containing thousands of rows, loading the entire file into memory at once isn't always practical. This article demonstrates how to process Excel files in manageable chunks using Python and Pandas.
Prerequisites
Before working with Excel files in Python, you need to install the required libraries ?
# Install required packages # pip install pandas openpyxl xlsxwriter import pandas as pd import xlsxwriter
Creating Sample Excel Data
First, let's create a sample Excel file to demonstrate chunk processing ?
import xlsxwriter
import pandas as pd
def write_data_to_files(inp_data, inp_file_name):
"""
Create an Excel file with the provided data
"""
print(f"Writing data to {inp_file_name}")
# Create a Workbook
workbook = xlsxwriter.Workbook(inp_file_name)
worksheet = workbook.add_worksheet()
# Start from the first cell
row = 0
col = 0
# Write data row by row
for player, titles in inp_data:
worksheet.write(row, col, player)
worksheet.write(row, col + 1, titles)
row += 1
workbook.close()
print(f"Completed writing data to {inp_file_name}")
# Sample data
file_name = "tennis_players.xlsx"
file_data = [
['Player', 'Grand Slam Titles'],
['Federer', 20],
['Nadal', 22],
['Djokovic', 23],
['Murray', 3]
]
write_data_to_files(file_data, file_name)
Writing data to tennis_players.xlsx Completed writing data to tennis_players.xlsx
Processing Excel Files in Chunks
Unlike CSV files, pandas doesn't provide a built-in chunksize parameter for Excel files. Here's a custom function to process Excel files in chunks ?
def process_excel_in_chunks(file_name, chunk_size):
"""
Process Excel file in chunks to handle large datasets efficiently
"""
chunks = []
chunk_number = 0
# Read header first
df_header = pd.read_excel(file_name, nrows=1)
skiprows = 1
while True:
# Read chunk without header
df_chunk = pd.read_excel(
file_name,
nrows=chunk_size,
skiprows=skiprows,
header=None
)
skiprows += chunk_size
# Break if no more data
if df_chunk.empty:
break
print(f"Processing chunk {chunk_number} with {len(df_chunk)} rows")
# Process your chunk here (example: data cleaning, calculations)
chunks.append(df_chunk)
chunk_number += 1
# Combine all chunks
if chunks:
df_chunks = pd.concat(chunks, ignore_index=True)
# Rename columns to match header
columns = {i: col for i, col in enumerate(df_header.columns)}
df_chunks.rename(columns=columns, inplace=True)
# Combine header with data
final_df = pd.concat([df_header, df_chunks], ignore_index=True)
print(f"Processing completed. Total rows: {len(final_df)}")
return final_df
return df_header
# Example usage with our sample file
result_df = process_excel_in_chunks("tennis_players.xlsx", 2)
print("\nFinal DataFrame:")
print(result_df)
Processing chunk 0 with 2 rows
Processing chunk 1 with 2 rows
Processing completed. Total rows: 5
Final DataFrame:
Player Grand Slam Titles
0 Player Grand Slam Titles
1 Federer 20
2 Nadal 22
3 Djokovic 23
4 Murray 3
Memory-Efficient Processing
For truly large files, you might want to process chunks individually without storing them all in memory ?
def process_large_excel_file(file_name, chunk_size):
"""
Process Excel file chunk by chunk without loading everything into memory
"""
chunk_number = 0
skiprows = 1
total_processed = 0
# Read header
df_header = pd.read_excel(file_name, nrows=1)
print(f"Columns: {list(df_header.columns)}")
while True:
df_chunk = pd.read_excel(
file_name,
nrows=chunk_size,
skiprows=skiprows,
header=None,
names=df_header.columns
)
if df_chunk.empty:
break
# Process individual chunk (example: filtering, aggregation)
processed_chunk = df_chunk.copy()
# Example processing: filter players with more than 10 titles
if 'Grand Slam Titles' in processed_chunk.columns:
filtered = processed_chunk[processed_chunk['Grand Slam Titles'] > 10]
if not filtered.empty:
print(f"Chunk {chunk_number}: Found {len(filtered)} players with >10 titles")
total_processed += len(df_chunk)
skiprows += chunk_size
chunk_number += 1
print(f"Total rows processed: {total_processed}")
# Process the file
process_large_excel_file("tennis_players.xlsx", 2)
Columns: ['Player', 'Grand Slam Titles'] Chunk 0: Found 2 players with >10 titles Chunk 1: Found 2 players with >10 titles Total rows processed: 4
Best Practices
When processing Excel files in chunks, consider these optimization strategies ?
def optimized_excel_processing(file_name, chunk_size=1000):
"""
Optimized approach for processing large Excel files
"""
try:
# Get file info first
df_info = pd.read_excel(file_name, nrows=0) # Just headers
total_rows = len(pd.read_excel(file_name))
print(f"File: {file_name}")
print(f"Columns: {list(df_info.columns)}")
print(f"Total rows: {total_rows}")
print(f"Processing in chunks of {chunk_size}")
# Calculate number of chunks
num_chunks = (total_rows // chunk_size) + (1 if total_rows % chunk_size else 0)
for i in range(num_chunks):
skiprows = i * chunk_size + 1 if i > 0 else 1
df_chunk = pd.read_excel(
file_name,
nrows=chunk_size,
skiprows=skiprows,
header=None,
names=df_info.columns
)
if not df_chunk.empty:
# Your processing logic here
yield df_chunk # Use generator for memory efficiency
except Exception as e:
print(f"Error processing file: {e}")
# Use the generator
for chunk_df in optimized_excel_processing("tennis_players.xlsx", 2):
print(f"Processing chunk with {len(chunk_df)} rows")
# Process each chunk as needed
File: tennis_players.xlsx Columns: ['Player', 'Grand Slam Titles'] Total rows: 5 Processing in chunks of 2 Processing chunk with 2 rows Processing chunk with 2 rows
Conclusion
Processing Excel files in chunks is essential for handling large datasets efficiently in Python. Use the custom chunking approach when pandas' built-in chunksize isn't available for Excel files, and consider using generators for memory-efficient processing of very large files.
