How to convert PDF files to Excel files using Python?

Python provides several libraries for handling file conversions. In this article, we'll learn how to convert PDF files to Excel format using the tabula-py module. The tabula-py library is built on Java technology and can extract tabular data from PDF documents and convert it into pandas DataFrames.

Prerequisites

Before working with tabula-py, ensure you have Java installed on your system, as the library depends on Java for PDF processing.

Installation

Install the required package using pip −

pip install tabula-py

Basic Syntax

The main functions we'll use are −

  • tabula.read_pdf() − Reads PDF and returns DataFrame
  • tabula.convert_into() − Directly converts PDF to Excel/CSV format

Method 1: Using read_pdf() and to_csv()

First approach reads the PDF into a DataFrame, then exports to CSV −

import tabula
import pandas as pd

# Read PDF file (replace with your PDF path)
df = tabula.read_pdf("sample_table.pdf", pages='all')[0]

# Convert DataFrame to CSV
df.to_csv("output_file.csv", index=False)

# Display the DataFrame
print(df.head())
print(f"Data successfully converted to CSV with {len(df)} rows")

Method 2: Using convert_into() Function

Direct conversion method that's more efficient for large files −

import tabula

# Direct PDF to CSV conversion
tabula.convert_into(
    "sample_table.pdf",        # Input PDF file
    "converted_data.csv",      # Output CSV file
    output_format="csv",       # Output format
    pages='all'               # Convert all pages
)

print("PDF successfully converted to CSV format")

# Read the converted file to verify
import pandas as pd
df = pd.read_csv("converted_data.csv")
print(f"Converted file contains {len(df)} rows and {len(df.columns)} columns")

Converting to Excel Format

To create an actual Excel file (.xlsx), combine tabula with pandas −

import tabula
import pandas as pd

# Read PDF data
tables = tabula.read_pdf("sample_table.pdf", pages='all')

# Create Excel writer object
with pd.ExcelWriter("output_file.xlsx", engine='openpyxl') as writer:
    for i, table in enumerate(tables):
        # Write each table to a separate sheet
        sheet_name = f"Table_{i+1}"
        table.to_excel(writer, sheet_name=sheet_name, index=False)

print(f"Created Excel file with {len(tables)} sheets")

Handling Multiple Pages

For PDFs with tables across multiple pages −

import tabula

# Read specific pages
df_page1 = tabula.read_pdf("document.pdf", pages=1)[0]
df_multiple = tabula.read_pdf("document.pdf", pages=[1, 2, 3])

# Read all pages and combine
all_tables = tabula.read_pdf("document.pdf", pages='all')
combined_df = pd.concat(all_tables, ignore_index=True)

print(f"Combined data has {len(combined_df)} total rows")

Parameters and Options

Parameter Description Example
pages Specify which pages to extract 'all', 1, [1,2,3]
multiple_tables Extract multiple tables per page True/False
area Define extraction area coordinates [y1, x1, y2, x2]
output_format Output file format "csv", "json"

Error Handling

Always include error handling for robust PDF processing −

import tabula
import pandas as pd

try:
    # Attempt to read PDF
    tables = tabula.read_pdf("input.pdf", pages='all')
    
    if tables and len(tables) > 0:
        # Save to Excel
        tables[0].to_excel("output.xlsx", index=False)
        print("Conversion successful!")
    else:
        print("No tables found in the PDF")
        
except Exception as e:
    print(f"Error during conversion: {str(e)}")
    print("Please check if the PDF contains tabular data and Java is installed")

Conclusion

The tabula-py library provides an efficient way to convert PDF tables to Excel format. Use convert_into() for direct conversion or combine read_pdf() with pandas for more control over the output format. Remember to have Java installed and handle errors gracefully for production use.

Updated on: 2026-03-25T19:30:04+05:30

35K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements