So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.