- Trending Categories
- Data Structure
- Operating System
- MS Excel
- C Programming
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Python β Reading contents of PDF using OCR (Optical Character Recognition)
PDF stands for Portable Document Format and is one of the popular file formats which can be exchanged between devices. Because the files in PDF format hold the text which cannot be changed. It gives the user easier readability and stability with the format of the files. Even though reading the text in PDF format is easier but when copying the contents from it may be time-consuming. To make the reading process easier, the OCR (Optical Character Recognition) tool is used.
Reading contents of PDF using OCR
In this article, we are going to deal with Optical Character Recognition or OCR which is an electronic tool that helps convert scanned images or handwritten text into editable computer files. Python supports several third-party libraries that make use of OCR technology to read content from PDFs.
One such library is Pytesseract. Itβs an optical character recognition (OCR) engine for Python which uses Google's Tesseract-OCR under the hood. Pytesseract can identify text in PDF files of over 100 languages including English, Hindi, Arabic and Chinese among others.
Optical Character Recognition:
OCR technology eliminates the manual reading of documents and saves time. Its applications are not just limited to document extraction but also extend to handwriting recognition, ID card recognition and identity verification. In conclusion, OCR is one of those tools every developer should get familiar with given its various use cases when dealing with images or pdf documents.
Python provides the flexibility needed to interact efficiently with many commercially available OCR libraries such as pytesseract making our projects run streamlined by scaling them up on large datasets without requiring human interaction. When we combine this power with different machine learning concepts like natural language processing (NLP) and object detection, there is no limit to how far we can push the limits of a computer's programmatic rendering capabilities.
Python Program to read the contents of PDF using OCR with try and except method
The input is given in the form of a PDF and named sample.pdf, then using the Optical Character Recognition tool; it recognizes the text in the PDF file and finally returns the sample text. To do this, the try and except method is used.
Step 1 β Import the required modules like os and pytesseract.
Step 2 β The image module is imported from the PIL package
Step 3 β The given pdf files are converted to images using the function named βconvert_from_path
Step 4 β The function is defined with one parameter as the input file name.
Step 5 β Empty list is initialized
Step 6 β The try method will convert each text in the PDF file into text.
Step 7 β For each image in the images list, generate a filename for each image and save it in JPEG format.
Step 8 β The text is extracted using the pytesseract module and then add it to the empty list initiated.
Step 9 β If there is any exception while performing the above steps, print it.
Step 10 β Generate an output file name by removing the extension from the input filename and appending .txt extension.
Step 11 β Write extracted text to the output file and return the output file name.
Step 12 β Define pdf_file variable with input filename.
Step 13 β Call read_pdf function with pdf_file variable as input and print its output.
# Importing the os module to perform file operations import os # Importing the pytesseract module to extract text from images import pytesseract as tess # Importing the Image module from the PIL package to work with images from PIL import Image # Importing the convert_from_path function from the pdf2image module to convert PDF files to images from pdf2image import convert_from_path #This function takes a PDF file name as input and returns the name of the text file that contains the extracted text. def read_pdf(file_name): # Store all pages of one file here: pages =  try: # Convert the PDF file to a list of PIL images: images = convert_from_path(file_name) # Extract text from each image: for i, image in enumerate(images): # Generating filename for each image filename = "page_" + str(i) + "_" + os.path.basename(file_name) + ".jpeg" image.save(filename, "JPEG") # Saving each image as JPEG text = tess.image_to_string(Image.open(filename)) # Extracting text from each image using pytesseract pages.append(text) # Appending extracted text to pages list except Exception as e: print(str(e)) # Write the extracted text to a file: output_file_name = os.path.splitext(file_name) + ".txt" # Generating output file name with open(output_file_name, "w") as f: f.write("\n".join(pages)) # Writing extracted text to output file return output_file_name #print function returns the final converted text pdf_file = "sample.pdf" print(read_pdf(pdf_file))
In the 21st century handling data is the most challenging task for organizations with a high volume of data and with the development of data science and machine learning it has become easier to access. The file which is most preferred to transmit without any changes is the pdf and so this approach helps people to convert them into the text file.
Kickstart Your Career
Get certified by completing the courseGet Started