How to convert PDF files to Excel files using Python?


Python has a large set of libraries for handling different types of operations. Through this article, we will see how to convert a pdf file to an Excel file. There are various packages are available in Python to convert pdf to CSV but we will use the tabula-py module. The major part of tabula-py is written in Java that reads the pdf document and converts the Python DataFrame into a JSON object.

In order to work with tabula-py, we must have java preinstalled in our system. Now, to convert the pdf file to csv we will follow the steps-

  • First, install the required package by typing pip install tabula-py in the command shell.

  • Now read the file using read_pdf("file location", pages=number) function. This will return the DataFrame.

  • Convert the DataFrame into an Excel file using tabula.convert_into(‘pdf-filename’, ‘name_this_file.csv’,output_format= "csv", pages= "all"). It generally exports the pdf file into an excel file

Example

In this example, we have used IPL Match Schedule Document to convert it into an excel file.

# Import the required Module
import tabula
# Read a PDF File
df = tabula.read_pdf("IPLmatch.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("IPLmatch.pdf", "iplmatch.csv", output_format="csv", pages='all')
print(df)

Output

Running the above code will convert the pdf file into an excel (csv) file.

Updated on: 26-Aug-2023

29K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements