python for-loop scikit-learn random-forest prediction

Save models predictions inside for loop with different names

I'm using scikit learn in roder to run random forest model on 3 different dataframes (aa,bb,cc) that I have. in order to do that I use for-loop and generate for each model confusion matrix. the problem is that I would like to save the presdictions of each model so I can later on use it for ROC analysis.

this is the original script of the loop:


todrop=['name','code','date','nitrogen','Hour','growth_day']
col='test'

dfs=[aa,bb,cc]


for h in dfs:
    h.drop(todrop,axis=1,inplace=True)
    y=h[col]
    X=h.drop(col,axis=1)
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
    
    rfc = RandomForestClassifier()
    rfc.fit(X_train,y_train)
    predictions = rfc.predict(X_test)
    conf_mat = confusion_matrix(y_test, predictions)
    df_conf_norm = conf_mat / conf_mat.sum(axis=0)

    index = ['healthy','sick'] 
    columns = ['healthy','sick'] 
    cm_df = pd.DataFrame(df_conf_norm,columns,index)
    

    seaborn.heatmap(cm_df,annot=True, annot_kws={"size": 15},linewidths=.5, cmap='YlOrRd')

    plt.title('Random Forest', fontsize = 20) # title with fontsize 20
    plt.xlabel('True Labels', fontsize = 17) # x-axis label with fontsize 15
    plt.ylabel('Prediction', fontsize = 17) # y-axis label with fontsize 15
    plt.show()

    print('Accuracy'+'{}'.format(accuracy_score(y_test,predictions)))

I have tried to save the model by this:

  predictions[h] = rfc.predict(X_test)

but then I get the error:

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

I have tried also to use zip and then to save it as names:

names=['aa','bb','cc']

for h,n in (zip(dfs,names)):
...
    predictions[n] = rfc.predict(X_test)

but got the same error.

My end goal here is to save those predictions (of each model) in order to create in the end ROC graph.

Solution

In each iteration of the loop you are creating a np.array variable called predictions:

predictions = rfc.predict(X_test)

So if you want to save each model in a variable such as a dict you need to declare it with a different name outside the loop:

all_predictions = dict()

Then modifying the code from your second example should work:

names=['aa','bb','cc']

for h,n in (zip(dfs,names)):
...
    all_predictions[n] = rfc.predict(X_test)