Search code examples
pythonpandasexcel-2010openpyxl

Is there a way to export a list of 100+ dataframes to excel?


So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.

So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:

from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active

for frame in main_df_list:
    for r in dataframe_to_rows(frame, index = True, header = True):
        ws.append(r)

Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.

Edit after reading Comments

So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.

import os
import pandas as pd
from openpyxl import load_workbook

#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'

#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []

#open the file path to store the directory in files
files = os.listdir(in_path)

#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index

#searching only for files that end with .xlsx
for file in files:
    if file.endswith('.xlsx'):
        file_list.append(in_path + '/' + file)

#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
    df = pd.read_excel(current_file)
    
    #get columns headers and a range for total rows
    columns = df.columns
    total_rows = df.index
    
    #adding to store where headers are stored in DF
    row_list = []
    column_list = []
    header_list = []

for name in columns:
        for number in total_rows:
            cell = df.at[number, name]
            if isinstance(cell, str) == False:
                continue
            elif cell == '':
                continue
            for place in search_size:
                search_loop = search.at[place, 'Parameters']
                #main compare, if str and matches search params, then do...
                if insensitive_compare(search_loop, cell) == True:
                    if cell not in header_list:
                        header_list.append(df.at[number, name]) #store data headers
                        row_list.append(number)  #store row number where it is in that data frame
                        column_list.append(name) #store column number where it is in that data frame
                    else:
                        continue
                else:
                    continue
    
    for thing in column_list:
        df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)

    #turns the dataframe into a set of booleans where its true if 
    #theres something there
    na_finder = df.notna()
    
    #create a new dataframe to write the output to
    outdf = pd.DataFrame(columns = header_list)


for i in range(len(row_list)):
        k = 0
        while na_finder.at[row_list[i] + k, column_list[i]] == True: 
        #I turn the dataframe into booleans and read until False
            if(df.at[row_list[i] + k, column_list[i]] not in header_list): 
            #Store actual dataframe into my output dataframe, outdf
                outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
            k += 1
            
    main_df.append(outdf)

So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like: Output to separate excel file


Solution

  • So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:

    # what I want to do here is grab all the data in the same column as each 
    # header, then move to the next column
    for i in range(len(row_list)):
        k = 0
        while na_finder.at[row_list[i] + k, column_list[i]] == True:
            if(df.at[row_list[i] + k, column_list[i]] not in header_list):
                outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
            k += 1
                
    main_df.append(outdf)
    
    to_xlsx_df = pd.DataFrame()
    
    for frame in main_df:
        to_xlsx_df = pd.concat([to_xlsx_df, frame])           
    
    to_xlsx_df.to_excel(outfile_path)
    

    The output to excel ended up looking something like this: Excel output

    Hopefully this can help someone else out too.