Search code examples
pythonexcelpandasdataframeglob

Import several excel files using glob and iteratively work on them and extract a data frame from results


I have a total of 28 excel files in a folder. Each of them consists of 72 columns (71 variables and 1 target). using the code below, I try to predict the target value for 1 excel file, which the final outcome should be A, B, C, and D.

How can I use glob so that the code reads all excel files, do the prediction process, and finally puts the final results (A, B, C, and D) of each excel file in a row of a new data frame (row1=excel1, row2=excel2 and ...)

import numpy as np
import pandas as pd                                                                   
import glob                                                                 
for file in list(glob.glob('*.xlsx')): 

   data = pd.read_excel(file)
   X = data.drop(['HWD'],1)
   Y = data['HWD']

   # Split DataSet to Train & Test
   from sklearn.model_selection import train_test_split
   X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                       test_size=0.2,
                                                       random_state=0)
   # Prediction 1
   from sklearn.tree import DecisionTreeRegressor
   from sklearn.ensemble import AdaBoostRegressor
   ABR = AdaBoostRegressor()
   ABR.fit(X_train, y_train)
   ABRtest=ABR.predict(X_test)

   # Calculate metrics 1
   from sklearn.metrics import mean_squared_error
   import math
   from scipy.stats import variation 
   from sklearn.metrics import mean_poisson_deviance

   df_list = []
   a = pd.DataFrame({'Obs':(y_test),'Pre':(ABRtest)})
   df_list.append(a)
   column_1 = a["Obs"]
   column_2 = a["Pre"]
   A = column_1.corr(column_2)
   MSE = mean_squared_error(column_1, column_2)
   B = math.sqrt(MSE)
   C = variation(column_2 , axis = 0)
   D = mean_poisson_deviance(column_1, column_2)

e below is the final row for each excel, I need a code that constructs a data frame with 28 rows containing results for each file.

df_list = []
e = pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]})
df_list.append(e)

Solution

  • You could use df.append() to add rows below the existing df

    e = pd.DataFrame()   #initiate an empty df
    for i in range(28):
        A,B,C,D = i,i,i,i   #assume you have the results here
        e = e.append(pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]}))
    e.reset_index(drop=True, inplace=True)
    

    Is this your desired output?

        CC RMSE RSD MPD
    0   0   0   0   0
    1   1   1   1   1
    2   2   2   2   2
    ...
    25  25  25  25  25
    26  26  26  26  26
    27  27  27  27  27
    

    To insert into your code:

    import glob
    e = pd.DataFrame({'CC':[],'RMSE':[],'RSD':[],'MPD':[]}, dtype='object')   #<--insert here, initiate an empty df                                                                
    for file in list(glob.glob('*.xlsx')): 
        data = pd.read_excel(file)
        ...
        C = variation(column_2 , axis = 0)
        D = mean_poisson_deviance(column_1, column_2)
        e = e.append(pd.DataFrame({'CC':[str(A)],'RMSE':[str(B)],'RSD':[str(C)],'MPD':[str(D)]}))   #<--insert here
    e.reset_index(drop=True, inplace=True)   #<--insert this line outside for-loop