Search code examples
pythonpdfpypdf

Problems with PyPDF ignoring some data


Hoping for some help, as I can't find a solution.

We currently have a lot of manual data inputs through people reading PDF files, and I have been asked to find a way to cut this time down. My solution would be to transform the PDF to a much easier readable format, then using grep to get rid of the standard fields (Just leaving the data behind). This would then be uploaded into a template, then into SAP.

However, then main problem has come at the first hurdle - transforming the PDF into a txt file. The code I use is as follows -

import sys
import pyPdf

def getPDFContent(path):
    content = ""
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

f = open('test.txt', 'w+')
f.write(getPDFContent("Adminform.pdf").encode("ascii", "ignore"))
f.close()

This works, however it ignores some data from the PDF files. To show you what I mean, this PDF page -

http://s23.postimg.org/6dqykomqj/error.png

From the first section (gender, title, name) produces the below -

*Title: *Legal First Name (s): *Your forename and second name (if applicable) as it appears on your passport or birth certificate. Address: *Legal Surname: *Your surname as it appears on your passport or birth certificate

Basically, the actual data that I want to capture is not being converted.

Anyone have a fix for this?

Thanks,


Solution

  • Generally speaking converting pdfs to text is a bad idea. It almost always is messy. There are linux utilities to do what you have implemented, but I don't expect them to do any better. I can suggest tabula you can find it at.

    http://tabula.technology/

    It is meant for extracting tables out of pdfs by manually delineating the boundaries of the table. But running on a pdf with no tables would output text with some formatting retained.

    There is some automation, although, limited. Refer

    https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool

    Also, may not entirely relevant here, you can use openrefine to manage messy data. Refer

    http://openrefine.org/