Search code examples
pythonpdfextractpdfminer

read pdf file horizontally with pdfminer


I would like to extract a pdf with pdfminer (version 20140328).

This is the code to extract the pdf:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
import urllib2

def pdf_to_string(data):
    fp = StringIO(data)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Process each page contained in the document.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    return data

pdf_url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/140836.pdf"
file_object = urllib2.urlopen(urllib2.Request(pdf_url)).read()
string=pdf_to_string(file_object)

This is a screenshot of the pdf: enter image description here

The problem is that pdfminer doesn't read it horizontally (person then position) but in columns (all the persons then all their respective positions):

Belgium: 
Mr Koen GEENS 

Bulgaria: 
Mr Petar CHOBANOV 

Czech Republic: 
Mr Radek URBAN 


Minister for Finance, with responsibility for the Civil 
Service 

Minister for Finance 

Deputy Minister for Finance 

How to make pdfminer read the text horizontally?


Solution

  • I have found a working solution with pdftotext:

    import tempfile, subprocess
    
    def pdf_to_string(file_object):
        pdfData = file_object.read()
        tf = tempfile.NamedTemporaryFile()
        tf.write(pdfData)
        tf.seek(0)
        outputTf = tempfile.NamedTemporaryFile()
    
        if (len(pdfData) > 0) :
            out, err = subprocess.Popen(["pdftotext", "-layout", tf.name, outputTf.name ]).communicate()
            return outputTf.read()
        else :
            return None
    
    pdf_file="files/2014_1.pdf"
    file_object = file(pdf_file, 'rb')
    print pdf_to_string(file_object)
    

    This produces the right output, person names then positions :).

    Solved!