Python – Reading contents of PDF using OCR (Optical Character Recognition)


PDF stands for Portable Document Format and is one of the popular file formats which can be exchanged between devices. Because the files in PDF format hold the text which cannot be changed. It gives the user easier readability and stability with the format of the files. Even though reading the text in PDF format is easier but when copying the contents from it may be time-consuming. To make the reading process easier, the OCR (Optical Character Recognition) tool is used.

Reading contents of PDF using OCR

In this article, we are going to deal with Optical Character Recognition or OCR which is an electronic tool that helps convert scanned images or handwritten text into editable computer files. Python supports several third-party libraries that make use of OCR technology to read content from PDFs.

One such library is Pytesseract. It’s an optical character recognition (OCR) engine for Python which uses Google's Tesseract-OCR under the hood. Pytesseract can identify text in PDF files of over 100 languages including English, Hindi, Arabic and Chinese among others.

Optical Character Recognition:

OCR technology eliminates the manual reading of documents and saves time. Its applications are not just limited to document extraction but also extend to handwriting recognition, ID card recognition and identity verification. In conclusion, OCR is one of those tools every developer should get familiar with given its various use cases when dealing with images or pdf documents.

Python provides the flexibility needed to interact efficiently with many commercially available OCR libraries such as pytesseract making our projects run streamlined by scaling them up on large datasets without requiring human interaction. When we combine this power with different machine learning concepts like natural language processing (NLP) and object detection, there is no limit to how far we can push the limits of a computer's programmatic rendering capabilities.

Python Program to read the contents of PDF using OCR with try and except method

The input is given in the form of a PDF and named sample.pdf, then using the Optical Character Recognition tool; it recognizes the text in the PDF file and finally returns the sample text. To do this, the try and except method is used.

Algorithm

  • Step 1 βˆ’ Import the required modules like os and pytesseract.

  • Step 2 βˆ’ The image module is imported from the PIL package

  • Step 3 βˆ’ The given pdf files are converted to images using the function named β€œconvert_from_path

  • Step 4 βˆ’ The function is defined with one parameter as the input file name.

  • Step 5 βˆ’ Empty list is initialized

  • Step 6 βˆ’ The try method will convert each text in the PDF file into text.

  • Step 7 βˆ’ For each image in the images list, generate a filename for each image and save it in JPEG format.

  • Step 8 βˆ’ The text is extracted using the pytesseract module and then add it to the empty list initiated.

  • Step 9 βˆ’ If there is any exception while performing the above steps, print it.

  • Step 10 βˆ’ Generate an output file name by removing the extension from the input filename and appending .txt extension.

  • Step 11 βˆ’ Write extracted text to the output file and return the output file name.

  • Step 12 βˆ’ Define pdf_file variable with input filename.

  • Step 13 βˆ’ Call read_pdf function with pdf_file variable as input and print its output.

Example

# Importing the os module to perform file operations
import os  
# Importing the pytesseract module to extract text from images
import pytesseract as tess  
# Importing the Image module from the PIL package to work with images
from PIL import Image  
# Importing the convert_from_path function from the pdf2image module to convert PDF files to images
from pdf2image import convert_from_path  

#This function takes a PDF file name as input and returns the name of the text file that contains the extracted text.
def read_pdf(file_name):   
    # Store all pages of one file here:
    pages = []

    try:
        # Convert the PDF file to a list of PIL images:
        images = convert_from_path(file_name)  

        # Extract text from each image:
        for i, image in enumerate(images):
          # Generating filename for each image
            filename = "page_" + str(i) + "_" + os.path.basename(file_name) + ".jpeg"  
            image.save(filename, "JPEG")  
          # Saving each image as JPEG
            text = tess.image_to_string(Image.open(filename))  # Extracting text from each image using pytesseract
            pages.append(text)  
          # Appending extracted text to pages list

    except Exception as e:
        print(str(e))

    # Write the extracted text to a file:
    output_file_name = os.path.splitext(file_name)[0] + ".txt"  # Generating output file name
    with open(output_file_name, "w") as f:
        f.write("\n".join(pages))  
      # Writing extracted text to output file

    return output_file_name

#print function returns the final converted text 
pdf_file = "sample.pdf"
print(read_pdf(pdf_file))

Input

Output

Conclusion

In the 21st century handling data is the most challenging task for organizations with a high volume of data and with the development of data science and machine learning it has become easier to access. The file which is most preferred to transmit without any changes is the pdf and so this approach helps people to convert them into the text file.

Updated on: 04-Sep-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements