- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What are the modules available in Python for converting PDF to text?
You can use the PDFMiner package to convert PDF to text.
Example
You can use it in the following way:
import sys from cStringIO import StringIO from pdfminer.pdfpage importPDFPage from pdfminer.pdfinterp importPDFResourceManager, PDFPageInterpreter from pdfminer.layout importLAParams from pdfminer.converter importXMLConverter, HTMLConverter, TextConverter def pdfparser(data): fp = file(data, 'rb') resource_manager = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(resource_manager,retstr, codec=codec, laparams=laparams) interpreter =PDFPageInterpreter(resource_manager, device) # Process each page contained in thedocument. for page in PDFPage.get_pages(fp): interpreter.process_page(page) data = retstr.getvalue() print data pdfparser('filename.pdf')
This takes in a pdf file and extracts text from it page by page using the process_page function from the PDFPageInterpreter class.
There is an alternative to PDFMiner with a much easier API to use for extracting text. pyPDF works fine(assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can do the following:
import pyPdf pdf = pyPdf.PdfFileReader(open('filename.pdf',"rb")) for page in pdf.pages: print page.extractText()
- Related Articles
- What are Python modules for date manipulation?
- What are the modules required for CGI programming in Python?
- What are common practices for modifying Python modules?
- What are the best Python 2.7 modules for data mining?
- What are the most interesting Python modules?
- What are the best practices to organize Python modules?
- What is the convention for structuring Python modules?
- PDF Viewer for Python Tkinter
- What are the available options for Data integration in SAP?
- Are there any certification exams available for Python?
- Where are the python modules stored?
- What are the various waits available in Selenium with python?
- What are the methods available for handling static dropdowns in a page in Selenium with python?
- What are the differences between json and simplejson Python modules?
- What are modules in JavaScript?

Advertisements