Search code examples
pythonarrayslistdictionarypdfplumber

Python: Parse through extracted lines


So i am trying to work on scrapping using PDFplumber and want to extract the text from this PDF and covert it into an excel (with each value - like the Expense apart from the numbers- in its own cell).

I started a bit of the code and was successfully able to break it down by line!! so each line prints on its own but now im having trouble splitting the numbers into their own "column" EX: Retail Rent 444,335.40 75.12 444,335.40 75.12

should become Retail Rent | 444,335.40 | 75.12 | 444,335.40 | 75.12

import pdfplumber

def extracted_lines_from_pdf(pdf_path): 
    extracted_lines = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            lines = page.extract_text().split('\n')
            extracted_lines.extend(lines)

    return extracted_lines

pdf_path = 'Sample_Numbers.pdf'
lines = extracted_lines_from_pdf(pdf_path)

for line in lines:
    print(line +'\n')

I have it in an list right now and want to essentially have a list i guess? ( i was hoping i could do it in a way where it would recognize "if there is a letter, this is the expense and if a number follows, the first number goes with "period to date", second is "%", etc.... if there is no number, go to the next line" (OUTPUT SO FAR IS BELOW)

I have the same structure: string, number, number, etc which are separated by spaces the end goal is to convert this to excel and have each entity in its own cell essentially



Solution

  • If you want to convert the list to a dataframe and export it as an xlsx file, this is one approach.

    import pdfplumber
    import pandas as pd
    
    def extracted_lines_from_pdf(pdf_path): 
        extracted_lines = []
    
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                lines = page.extract_text().split('\n')
                extracted_lines.extend(lines)
    
        return extracted_lines
    
    pdf_path = 'Sample_Numbers.pdf'
    lines = extracted_lines_from_pdf(pdf_path)
    
    data = [[item] for item in lines]
    
    df = pd.DataFrame(data)
    
    df.to_excel('output.xlsx', index=False, header=False)
    
    

    Output:

    enter image description here