Search code examples
pythonpandastabula

Iterating through files and passing each one through a function


I'm trying to build a budget calculator in order to practice python. At the moment i'm trying to iterate through the files in a directory, and then pass each file through a function to extract the data I need to a DataFrame (ready for it to have calculations performed on it).

I've managed to create the function to clean the data, and a for loop to iterate over the files. However, I cannot figure out how to append the DataFrame for each iteration.

#Where to look
os.chdir(r"C:\relevant\directory")
cwd = os.getcwd()

#key variables
main_df = pd.DataFrame()
pay_slip = {}
master_df = pd.DataFrame()

#Iterate over files
for file in os.listdir():
    slip_content = read_pdf(file)
    pay_slip[file] = slip_content

#Data clean up function
def get_key_info(pay_slip):
    read_dictionary = pay_slip.get(file)
    salary_str = read_dictionary["Employee"].iloc[2]
    pay_after_tax_str = read_dictionary["Tax Period"].iloc[14]
    date_format = read_dictionary["Pay Date"].iloc[0]
    salary = int(float(salary_str[1:].replace(",", "")))
    pay = int(float(pay_after_tax_str[1:].replace(",", "")))
    deductions = (salary - pay)
    df = pd.DataFrame([
        [date_format, salary, pay, deductions]
        ],
        columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
    return df

print(get_key_info(pay_slip))

When I run this code, only one of the files is added to the DataFrame instead of all the files as it should.

Thanks in advance for any help


Solution

  • Thanks for the help Florian, i've fixed this looping over my directory like you said.

    However, I was unable to loop over a dictionary as it is not hashable.

    I'll post my code below in case anyone else is having the same issue as I was.

        #Where to look
    os.chdir(r"C:\relevant\directory")
    cwd = os.getcwd()
    
    #key variables
    master_df = pd.DataFrame()
    
    
    #Data clean up function
    def get_key_info(x):
        salary_str = get_data["Employee"].iloc[2]
        pay_after_tax_str = get_data["Tax Period"].iloc[14]
        date_format = get_data["Pay Date"].iloc[0]
        salary = int(float(salary_str[1:].replace(",", "")))
        pay = int(float(pay_after_tax_str[1:].replace(",", "")))
        deductions = (salary - pay)
        df = pd.DataFrame([
            [date_format, salary, pay, deductions]
            ],
            columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
        return df
    
    #Iterate over files
    for f in os.listdir():
        get_data = read_pdf(f)
        master_df = master_df.append(get_key_info(f), ignore_index = True)
    
    print(master_df)
    

    Here I have set the variable get_data to change every iteration of the for loop, which will then .append() the master_df