- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What are the modules available in Python for converting PDF to text?
You can use the PDFMiner package to convert PDF to text.
Example
You can use it in the following way:
import sys from cStringIO import StringIO from pdfminer.pdfpage importPDFPage from pdfminer.pdfinterp importPDFResourceManager, PDFPageInterpreter from pdfminer.layout importLAParams from pdfminer.converter importXMLConverter, HTMLConverter, TextConverter def pdfparser(data): fp = file(data, 'rb') resource_manager = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(resource_manager,retstr, codec=codec, laparams=laparams) interpreter =PDFPageInterpreter(resource_manager, device) # Process each page contained in thedocument. for page in PDFPage.get_pages(fp): interpreter.process_page(page) data = retstr.getvalue() print data pdfparser('filename.pdf')
This takes in a pdf file and extracts text from it page by page using the process_page function from the PDFPageInterpreter class.
There is an alternative to PDFMiner with a much easier API to use for extracting text. pyPDF works fine(assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can do the following:
import pyPdf pdf = pyPdf.PdfFileReader(open('filename.pdf',"rb")) for page in pdf.pages: print page.extractText()
Advertisements