Python module for converting PDF to text [closed]

Posted on

Solving problem is about exposing yourself to as many situations as possible like Python module for converting PDF to text [closed] and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Python module for converting PDF to text [closed], which can be followed any time. Take easy to follow this discuss.

Python module for converting PDF to text [closed]

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Asked By: cnu

||

Answer #1:

Try PDFMiner. It can extract text from PDF files as HTML, SGML or “Tagged PDF” format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

Answered By: David Crow

Answer #2:

The PDFMiner package has changed since codeape posted.

EDIT (again):

PDFMiner has been updated again in version 20100213

You can check the version you have installed with the following:

>>> import pdfminer
>>> pdfminer.__version__
'20100213'

Here’s the updated version (with comments on what I changed/added):

def pdf_to_csv(filename):
    from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
    from pdfminer.converter import LTTextItem, TextConverter
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)
        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTTextItem):
                    (_,_,x,y) = child.bbox                   #<-- changed
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)  #<-- changed
            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("n")
    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)       #<-- changed
    parser.set_document(doc)     #<-- added
    doc.set_parser(parser)       #<-- added
    doc.initialize('')
    interpreter = PDFPageInterpreter(rsrc, device)
    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %dn" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %dn" % i)
    device.close()
    fp.close()
    return outfp.getvalue()

Edit (yet again):

Here is an update for the latest version in pypi, 20100619p1. In short I replaced LTTextItem with LTChar and passed an instance of LAParams to the CsvConverter constructor.

def pdf_to_csv(filename):
    from cStringIO import StringIO
    from pdfminer.converter import LTChar, TextConverter    #<-- changed
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)
        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, LTChar):               #<-- changed
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.text.encode(self.codec)
            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("n")
    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize('')
    interpreter = PDFPageInterpreter(rsrc, device)
    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %dn" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %dn" % i)
    device.close()
    fp.close()
    return outfp.getvalue()

EDIT (one more time):

Updated for version 20110515 (thanks to Oeufcoque Penteano!):

def pdf_to_csv(filename):
    from cStringIO import StringIO
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfparser import PDFDocument, PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)
        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:                #<-- changed
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child._text.encode(self.codec) #<-- changed
            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("n")
    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())
        # becuase my test documents are utf-8 (note: utf-8 is the default codec)
    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize('')
    interpreter = PDFPageInterpreter(rsrc, device)
    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %dn" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %dn" % i)
    device.close()
    fp.close()
    return outfp.getvalue()
Answered By: tgray

Answer #3:

Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner. This will work for those who are getting import errors with process_pdf

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def pdfparser(data):
    fp = file(data, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()
    print data
if __name__ == '__main__':
    pdfparser(sys.argv[1])

See below code that works for Python 3:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
def pdfparser(data):
    fp = open(data, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()
    print(data)
if __name__ == '__main__':
    pdfparser(sys.argv[1])
Answered By: user3272884

Answer #4:

Pdftotext An open source program (part of Xpdf) which you could call from python (not what you asked for but might be useful). I’ve used it with no problems. I think google use it in google desktop.

Answered By: Jamie

Answer #5:

pyPDF works fine (assuming that you’re working with well-formed PDFs). If all you want is the text (with spaces), you can just do:

import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
    print page.extractText()

You can also easily get access to the metadata, image data, and so forth.

A comment in the extractText code notes:

Locate all text drawing commands, in
the order they are provided in the
content stream, and extract the text.
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.

Whether or not this is a problem depends on what you’re doing with the text (e.g. if the order doesn’t matter, it’s fine, or if the generator adds text to the stream in the order it will be displayed, it’s fine). I have pyPdf extraction code in daily use, without any problems.

Answered By: Tony Meyer

Answer #6:

You can also quite easily use pdfminer as a library. You have access to the pdf’s content model, and can create your own text extraction. I did this to convert pdf contents to semi-colon separated text, using the code below.

The function simply sorts the TextItem content objects according to their y and x coordinates, and outputs items with the same y coordinate as one text line, separating the objects on the same line with ‘;’ characters.

Using this approach, I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from. Other tools I tried include pdftotext, ps2ascii and the online tool pdftextonline.com.

pdfminer is an invaluable tool for pdf-scraping.


def pdf_to_csv(filename):
    from pdflib.page import TextItem, TextConverter
    from pdflib.pdfparser import PDFDocument, PDFParser
    from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter
    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)
        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item.objs:
                if isinstance(child, TextItem):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.text
            for y in sorted(lines.keys()):
                line = lines[y]
                self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                self.outfp.write("n")
    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, "ascii")
    doc = PDFDocument()
    fp = open(filename, 'rb')
    parser = PDFParser(doc, fp)
    doc.initialize('')
    interpreter = PDFPageInterpreter(rsrc, device)
    for i, page in enumerate(doc.get_pages()):
        outfp.write("START PAGE %dn" % i)
        interpreter.process_page(page)
        outfp.write("END PAGE %dn" % i)
    device.close()
    fp.close()
    return outfp.getvalue()

UPDATE:

The code above is written against an old version of the API, see my comment below.

Answered By: codeape

Answer #7:

slate is a project that makes it very simple to use PDFMiner from a library:

>>> with open('example.pdf') as f:
...    doc = slate.PDF(f)
...
>>> doc
[..., ..., ...]
>>> doc[1]
'Text from page 2...'
Answered By: Tim McNamara

Answer #8:

I needed to convert a specific PDF to plain text within a python module. I used PDFMiner 20110515, after reading through their pdf2txt.py tool I wrote this simple snippet:

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
def to_txt(pdf_path):
    input_ = file(pdf_path, 'rb')
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    process_pdf(manager, converter, input_)
    return output.getvalue()
Answered By: gonz

Leave a Reply

Your email address will not be published. Required fields are marked *