Search code examples
pythonpdfpdfminerpypdf

Python PDF read straight across as how it looks in the PDF


If I use the code in the answer here: Extracting text from a PDF file using PDFMiner in python?

I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf

However, you see under "CONSOLIDATED INCOME STATEMENT", it reads down ... ie... Revenues VAS Online advertising then later it reads the numbers... I want it to read across, ie:

Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 etc... is there a way to do this?

Looking for other possible solutions other than pdfminer.

And if I try using this code for PyPDF2 not all of the text even shows up:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open(file, 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
a=(pdfReader.numPages)

# creating a page object
for i in range(0,a):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())

Solution

  • You can use PDFMiner to do the job and in my experience it works better than other open source Python tools out there.

    The key is to specify the laparams parameter correctly and not leave it to its default values. This parameter is used to give PDFMiner more information about the layout of the page. Since the text here corresponds to tables with wide spaces, we need to instruct PDFMiner to use a large character margin (char_margin).

    The code for the layout is here. Play around with the hyperparameters that give the best results for this particular document.

    Here's a sample code for the pdf in question. I am using only a single page for demonstration here:

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from io import StringIO
    
    def convert_pdf_to_txt(path, pages):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
    
        laparams=LAParams(all_texts=True, detect_vertical=True, 
                          line_overlap=0.5, char_margin=1000.0, #set char_margin to a large number
                          line_margin=0.5, word_margin=2,
                          boxes_flow=1)
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(path, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos=set(pages)
    
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    
        text = retstr.getvalue()
    
        fp.close()
        device.close()
        retstr.close()
        return text
    
    pdf_text_page6 = convert_pdf_to_txt("15000691526464720.pdf", pages=[6])
    

    The output for the given page (page 6 corresponding to page 7 in the document) looks like the block below. It is not perfect but all the numerical components of the table are captured in the same line as the text.

    Page 7 of 11 
    
      Unaudited    Unaudited 
    
      1Q2018  1Q2017   1Q2018  4Q2017 
    
    Revenues  73,528  49,552   73,528  66,392 
    
        VAS   46,877  35,108   46,877  39,947 
    
       Online advertising   10,689  6,888   10,689  12,361 
    
        Others  15,962  7,556   15,962  14,084 
    
    Cost of revenues  (36,486)  (24,109)   (36,486)  (34,897) 
    
    Gross profit  37,042  25,443   37,042  31,495