Search code examples
pythonfor-loopscikit-learnrandom-forestprediction

Save models predictions inside for loop with different names


I'm using scikit learn in roder to run random forest model on 3 different dataframes (aa,bb,cc) that I have. in order to do that I use for-loop and generate for each model confusion matrix. the problem is that I would like to save the presdictions of each model so I can later on use it for ROC analysis.

this is the original script of the loop:


todrop=['name','code','date','nitrogen','Hour','growth_day']
col='test'

dfs=[aa,bb,cc]


for h in dfs:
    h.drop(todrop,axis=1,inplace=True)
    y=h[col]
    X=h.drop(col,axis=1)
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
    
    rfc = RandomForestClassifier()
    rfc.fit(X_train,y_train)
    predictions = rfc.predict(X_test)
    conf_mat = confusion_matrix(y_test, predictions)
    df_conf_norm = conf_mat / conf_mat.sum(axis=0)

    index = ['healthy','sick'] 
    columns = ['healthy','sick'] 
    cm_df = pd.DataFrame(df_conf_norm,columns,index)
    

    seaborn.heatmap(cm_df,annot=True, annot_kws={"size": 15},linewidths=.5, cmap='YlOrRd')

    plt.title('Random Forest', fontsize = 20) # title with fontsize 20
    plt.xlabel('True Labels', fontsize = 17) # x-axis label with fontsize 15
    plt.ylabel('Prediction', fontsize = 17) # y-axis label with fontsize 15
    plt.show()

    print('Accuracy'+'{}'.format(accuracy_score(y_test,predictions))) 

I have tried to save the model by this:

  predictions[h] = rfc.predict(X_test)  

but then I get the error:

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

I have tried also to use zip and then to save it as names:

names=['aa','bb','cc']

for h,n in (zip(dfs,names)):
...
    predictions[n] = rfc.predict(X_test)

but got the same error.

My end goal here is to save those predictions (of each model) in order to create in the end ROC graph.


Solution

  • In each iteration of the loop you are creating a np.array variable called predictions:

    predictions = rfc.predict(X_test)
    

    So if you want to save each model in a variable such as a dict you need to declare it with a different name outside the loop:

    all_predictions = dict()
    

    Then modifying the code from your second example should work:

    names=['aa','bb','cc']
    
    for h,n in (zip(dfs,names)):
    ...
        all_predictions[n] = rfc.predict(X_test)