What are the modules available in Python for converting PDF to text?


You can use the PDFMiner package to convert PDF to text.

Example

You can use it in the following way: 

 import sys
from cStringIO import StringIO
 from pdfminer.pdfpage importPDFPage
from pdfminer.pdfinterp importPDFResourceManager, PDFPageInterpreter
from pdfminer.layout importLAParams
from pdfminer.converter importXMLConverter, HTMLConverter, TextConverter
 def pdfparser(data):
    fp = file(data, 'rb')
    resource_manager = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(resource_manager,retstr, codec=codec, laparams=laparams)
    interpreter =PDFPageInterpreter(resource_manager, device)
 
    # Process each page contained in thedocument.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data = retstr.getvalue()
    print data
 pdfparser('filename.pdf')

This takes in a pdf file and extracts text from it page by page using the process_page function from the PDFPageInterpreter class. 

There is an alternative to PDFMiner with a much easier API to use for extracting text. pyPDF works fine(assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can do the following:

import pyPdf
pdf = pyPdf.PdfFileReader(open('filename.pdf',"rb"))
for page in pdf.pages:
    print page.extractText()

Updated on: 11-Dec-2019

183 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements