Search code examples
pythonpypdfpdfplumberrecursionerror

PDF Parsing a sentence across multiple Lines


Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines).

I am able to print() the line the phrase appears in.

Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from the previous sentence, and iterate forward again until the next sentence terminator.

This is so as I can print() the entire sentence the phrase belongs in.


Jupyter Notebook:

# pip install PyPDF2
# pip install pdfplumber

# ---
# import re
import glob
import PyPDF2
import pdfplumber

# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')

def scrape_sentence(sentence, lines, index):
    if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
        return sentence.replace('\n', '').strip()
    sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1)  # previous line
    sentence = scrape_sentence(sentence + lines[index+1], lines, index+1)  # 
following line    
    return sentence
    
# ---    
    
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
                sentence = scrape_sentence('', lines, i)  # !
                print(sentence)  # !
            i += 1

Output:

connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there

Phrase:

Responsible Care Company

Sentence (across multiple lines):

"GPIC is a Responsible Care Company certified for RC 14001 
since July 2010."

PDF (pg. 2).

I have been working on "back-tracking" iterations, based on this solution. I did try a for-loop, but it doesn't let you back back iterations.

Regex sentence added


Please let me know if there is anything else I can add to post.


Solution

  • I have a working version. However, this does not account for multiple columns of text from a .pdf page.

    See here for a discussion related to that.


    Example .pdf

    Jupyter Notebook:

    # pip install PyPDF2
    # pip install pdfplumber
    
    # ---
    
    import glob
    import PyPDF2
    import pdfplumber
    
    # ---
    
    def scrape_sentence(phrase, lines, index):
        # -- Gather sentence 'phrase' occurs in --
        sentence = lines[index]
        print("-- sentence --", sentence)
        print("len(lines)", len(lines))
        
        # Previous lines
        pre_i, flag = index, 0
        while flag == 0:
            pre_i -= 1
            if pre_i <= 0:
                break
                
            sentence = lines[pre_i] + sentence
            
            if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or '  •  ' in lines[pre_i]:
                flag == 1
        
        print("\n", sentence)
        
        # Following lines
        post_i, flag = index, 0
        while flag == 0:
            post_i += 1
            if post_i >= len(lines):
                break
                
            sentence = sentence + lines[post_i] 
            
            if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or '  •  ' in lines[pre_i]:
                flag == 1 
        
        print("\n", sentence)
        
        # -- Extract --
        sentence = sentence.replace('!', '.')
        sentence = sentence.replace('?', '.')
        sentence = sentence.split('.')
        sentence = [s for s in sentence if phrase in s]
        print(sentence)
        sentence = sentence[0].replace('\n', '').strip()  # first occurance
        print(sentence)
        
        return sentence
    
    # ---
    
    phrase = 'Global Reporting Initiative'
    
    with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
        for page in opened_pdf.pages:
            text = page.extract_text()
            if text == None:
                continue
            lines = text.split('\n')
            i = 0
            sentence = ''
            while i < len(lines):
                if phrase in lines[i]:
                    sentence = scrape_sentence(phrase, lines, i)
                i += 1
    

    Output:

    -- sentence -- 2016 Global Reporting Initiative (GRI) Report
    len(lines) 7
    
     2016 Global Reporting Initiative (GRI) Report
    
     2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
    ['2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01']
    2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
    
    ...