Search code examples
pythonappendconcatenation

How to improve my append and read excel For loop in python


Hope you can help me.

I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.

The code in question is this:

df_list = []    
for file in location:
    df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))

df_concat = pd.concat(df_list)

Any suggestions?

Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated? And considering df_list is a list, can I do that conversion?


Solution

  • I've found a solution with xlsx2csv

    xlsx_path = './data/Extract/'
    csv_path = './data/csv/'
    list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
    
    
    for xlsx in list_of_xlsx:
        # Extract File Name on group 2 "(.+)"
        filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
        # Setup the call for subprocess.call()
        call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
        try:
            subprocess.call(call) # On Windows use shell=True
        except:
            print('Failed with {}'.format(filepath)
    
    outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
    
    listofdataframes = []
    for file in glob.glob(csv_path+'*.csv'):
        df = pd.read_csv(file)
        if df.shape[1] == 24: # make sure 24 columns
            listofdataframes.append(df)
        else:
            print('{}  has {} columns - skipping'.format(file,df.shape[1]))
    
    bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
    bigdataframe.to_csv(outputcsv,index=False)
    

    I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?