I'm trying to build a budget calculator in order to practice python. At the moment i'm trying to iterate through the files in a directory, and then pass each file through a function to extract the data I need to a DataFrame (ready for it to have calculations performed on it).
I've managed to create the function to clean the data, and a for loop to iterate over the files. However, I cannot figure out how to append the DataFrame for each iteration.
#Where to look
os.chdir(r"C:\relevant\directory")
cwd = os.getcwd()
#key variables
main_df = pd.DataFrame()
pay_slip = {}
master_df = pd.DataFrame()
#Iterate over files
for file in os.listdir():
slip_content = read_pdf(file)
pay_slip[file] = slip_content
#Data clean up function
def get_key_info(pay_slip):
read_dictionary = pay_slip.get(file)
salary_str = read_dictionary["Employee"].iloc[2]
pay_after_tax_str = read_dictionary["Tax Period"].iloc[14]
date_format = read_dictionary["Pay Date"].iloc[0]
salary = int(float(salary_str[1:].replace(",", "")))
pay = int(float(pay_after_tax_str[1:].replace(",", "")))
deductions = (salary - pay)
df = pd.DataFrame([
[date_format, salary, pay, deductions]
],
columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
return df
print(get_key_info(pay_slip))
When I run this code, only one of the files is added to the DataFrame instead of all the files as it should.
Thanks in advance for any help
Thanks for the help Florian, i've fixed this looping over my directory like you said.
However, I was unable to loop over a dictionary as it is not hashable.
I'll post my code below in case anyone else is having the same issue as I was.
#Where to look
os.chdir(r"C:\relevant\directory")
cwd = os.getcwd()
#key variables
master_df = pd.DataFrame()
#Data clean up function
def get_key_info(x):
salary_str = get_data["Employee"].iloc[2]
pay_after_tax_str = get_data["Tax Period"].iloc[14]
date_format = get_data["Pay Date"].iloc[0]
salary = int(float(salary_str[1:].replace(",", "")))
pay = int(float(pay_after_tax_str[1:].replace(",", "")))
deductions = (salary - pay)
df = pd.DataFrame([
[date_format, salary, pay, deductions]
],
columns=["Payment date", "Salary before tax", "take home pay", "total deductions"])
return df
#Iterate over files
for f in os.listdir():
get_data = read_pdf(f)
master_df = master_df.append(get_key_info(f), ignore_index = True)
print(master_df)
Here I have set the variable get_data
to change every iteration of the for loop, which will then .append()
the master_df