I have a total of 28 excel files in a folder. Each of them consists of 72 columns (71 variables and 1 target). using the code below, I try to predict the target value for 1 excel file, which the final outcome should be A, B, C, and D.
How can I use glob so that the code reads all excel files, do the prediction process, and finally puts the final results (A, B, C, and D) of each excel file in a row of a new data frame (row1=excel1, row2=excel2 and ...)
import numpy as np
import pandas as pd
import glob
for file in list(glob.glob('*.xlsx')):
data = pd.read_excel(file)
X = data.drop(['HWD'],1)
Y = data['HWD']
# Split DataSet to Train & Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.2,
random_state=0)
# Prediction 1
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
ABR = AdaBoostRegressor()
ABR.fit(X_train, y_train)
ABRtest=ABR.predict(X_test)
# Calculate metrics 1
from sklearn.metrics import mean_squared_error
import math
from scipy.stats import variation
from sklearn.metrics import mean_poisson_deviance
df_list = []
a = pd.DataFrame({'Obs':(y_test),'Pre':(ABRtest)})
df_list.append(a)
column_1 = a["Obs"]
column_2 = a["Pre"]
A = column_1.corr(column_2)
MSE = mean_squared_error(column_1, column_2)
B = math.sqrt(MSE)
C = variation(column_2 , axis = 0)
D = mean_poisson_deviance(column_1, column_2)
e below is the final row for each excel, I need a code that constructs a data frame with 28 rows containing results for each file.
df_list = []
e = pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]})
df_list.append(e)
You could use df.append()
to add rows below the existing df
e = pd.DataFrame() #initiate an empty df
for i in range(28):
A,B,C,D = i,i,i,i #assume you have the results here
e = e.append(pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]}))
e.reset_index(drop=True, inplace=True)
Is this your desired output?
CC RMSE RSD MPD
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
...
25 25 25 25 25
26 26 26 26 26
27 27 27 27 27
To insert into your code:
import glob
e = pd.DataFrame({'CC':[],'RMSE':[],'RSD':[],'MPD':[]}, dtype='object') #<--insert here, initiate an empty df
for file in list(glob.glob('*.xlsx')):
data = pd.read_excel(file)
...
C = variation(column_2 , axis = 0)
D = mean_poisson_deviance(column_1, column_2)
e = e.append(pd.DataFrame({'CC':[str(A)],'RMSE':[str(B)],'RSD':[str(C)],'MPD':[str(D)]})) #<--insert here
e.reset_index(drop=True, inplace=True) #<--insert this line outside for-loop