Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to convert PDF files to Excel files using Python?
Python provides several libraries for handling file conversions. In this article, we'll learn how to convert PDF files to Excel format using the tabula-py module. The tabula-py library is built on Java technology and can extract tabular data from PDF documents and convert it into pandas DataFrames.
Prerequisites
Before working with tabula-py, ensure you have Java installed on your system, as the library depends on Java for PDF processing.
Installation
Install the required package using pip −
pip install tabula-py
Basic Syntax
The main functions we'll use are −
-
tabula.read_pdf()− Reads PDF and returns DataFrame -
tabula.convert_into()− Directly converts PDF to Excel/CSV format
Method 1: Using read_pdf() and to_csv()
First approach reads the PDF into a DataFrame, then exports to CSV −
import tabula
import pandas as pd
# Read PDF file (replace with your PDF path)
df = tabula.read_pdf("sample_table.pdf", pages='all')[0]
# Convert DataFrame to CSV
df.to_csv("output_file.csv", index=False)
# Display the DataFrame
print(df.head())
print(f"Data successfully converted to CSV with {len(df)} rows")
Method 2: Using convert_into() Function
Direct conversion method that's more efficient for large files −
import tabula
# Direct PDF to CSV conversion
tabula.convert_into(
"sample_table.pdf", # Input PDF file
"converted_data.csv", # Output CSV file
output_format="csv", # Output format
pages='all' # Convert all pages
)
print("PDF successfully converted to CSV format")
# Read the converted file to verify
import pandas as pd
df = pd.read_csv("converted_data.csv")
print(f"Converted file contains {len(df)} rows and {len(df.columns)} columns")
Converting to Excel Format
To create an actual Excel file (.xlsx), combine tabula with pandas −
import tabula
import pandas as pd
# Read PDF data
tables = tabula.read_pdf("sample_table.pdf", pages='all')
# Create Excel writer object
with pd.ExcelWriter("output_file.xlsx", engine='openpyxl') as writer:
for i, table in enumerate(tables):
# Write each table to a separate sheet
sheet_name = f"Table_{i+1}"
table.to_excel(writer, sheet_name=sheet_name, index=False)
print(f"Created Excel file with {len(tables)} sheets")
Handling Multiple Pages
For PDFs with tables across multiple pages −
import tabula
# Read specific pages
df_page1 = tabula.read_pdf("document.pdf", pages=1)[0]
df_multiple = tabula.read_pdf("document.pdf", pages=[1, 2, 3])
# Read all pages and combine
all_tables = tabula.read_pdf("document.pdf", pages='all')
combined_df = pd.concat(all_tables, ignore_index=True)
print(f"Combined data has {len(combined_df)} total rows")
Parameters and Options
| Parameter | Description | Example |
|---|---|---|
pages |
Specify which pages to extract |
'all', 1, [1,2,3]
|
multiple_tables |
Extract multiple tables per page |
True/False
|
area |
Define extraction area coordinates | [y1, x1, y2, x2] |
output_format |
Output file format |
"csv", "json"
|
Error Handling
Always include error handling for robust PDF processing −
import tabula
import pandas as pd
try:
# Attempt to read PDF
tables = tabula.read_pdf("input.pdf", pages='all')
if tables and len(tables) > 0:
# Save to Excel
tables[0].to_excel("output.xlsx", index=False)
print("Conversion successful!")
else:
print("No tables found in the PDF")
except Exception as e:
print(f"Error during conversion: {str(e)}")
print("Please check if the PDF contains tabular data and Java is installed")
Conclusion
The tabula-py library provides an efficient way to convert PDF tables to Excel format. Use convert_into() for direct conversion or combine read_pdf() with pandas for more control over the output format. Remember to have Java installed and handle errors gracefully for production use.
