Search code examples
pythonpdfdatanitro

Mining pdf Data with python through clipboard - Python Scripting the OS


I have written a script that extracts data from pdf. I am using the win32clipboard module to copy the the data into python. Got the logic working on how to get the data I need in each file.

The shortcoming of my process is that I have to open each pdf Ctr-A to Select all then Ctrl-C to get it into clipboard. I then run my script. for reference it is running within Excel using DataNitro.

I have tried PDFMiner, but it seems like it is not being maintained and tend break the text into small bits. The PDF that I am mining contain lots of "small" tables. the copy from clipboard seem to do a pretty descent job of keeping related things together.

Any suggestions on how I can script the opening of PDF selecting all and copying. Basically I am looking for a python way to script the OS. Gut feel is that this is not possible, but maybe somebody knows.


Solution

  • I have settled on using pyPdf. It has a simple method that just extracts the text from the pdf. I have written simple functions to find the relevant information I need in this text. Splitting the text into list for easy data identification.

    Have also written a loop to to pick up the relevant files using glob search and feeding it into the parser.

    import pyPdf
    pdf = pyPdf.PdfFileReader(open(filename, "rb"))
    data = ''
    for page in pdf.pages:
       data += page.extractText()
    data2 = data.split('\n')