Search code examples
pythonpandasdataframesklearn-pandas

Collate model coefficients across multiple test-train splits from sklearn


I would like to combine the model/feature coefficients from multiple (random) test-train splits into a single dataframe in python.

Currently, my approach this is to generate model coefficients for each test-train split one at a time and then combining them at the end of the code.

While this works, this is excessively verbose and not feasible to extend to very large number of test-train splits.

Can somebody simplify my approach with a simple for loop perhaps? My inelegant, excessively verbose, code follows below:

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)

#### import some data 
iris = datasets.load_iris()

X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]

#####test_train split #1
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=11)
log.fit(train_x, train_y) #fit final model 

pred_y = log.predict(test_x) #store final model predictions 
probs_y = log.predict_proba(test_x) #final model class probabilities

coeff_final1 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final1.columns=("features", "coefficients_1")

######test_train split #2
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=444)
log.fit(train_x, train_y) #fit final model 

pred_y = log.predict(test_x) #store final model predictions 
probs_y = log.predict_proba(test_x) #final model class probabilities

coeff_final2 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final2.columns=("features", "coefficients_2")

#####test_train split #3
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=21)
log.fit(train_x, train_y) #fit final model 

pred_y = log.predict(test_x) #store final model predictions 
probs_y = log.predict_proba(test_x) #final model class probabilities

coeff_final3 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final3.columns=("features", "coefficients_3")

#####test_train split #4
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=109)
log.fit(train_x, train_y) #fit final model 

pred_y = log.predict(test_x) #store final model predictions 
probs_y = log.predict_proba(test_x) #final model class probabilities

coeff_final4 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final4.columns=("features", "coefficients_4")

#####test_train split #5
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=1900)
log.fit(train_x, train_y) #fit final model 

pred_y = log.predict(test_x) #store final model predictions 
probs_y = log.predict_proba(test_x) #final model class probabilities

coeff_final5 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final5.columns=("features", "coefficients_5")

#######Append features/coefficients & odds ratios across 5 test-train splits

#append all coefficients into a single dataframe
coeff_table = pd.concat([coeff_final1, coeff_final2["coefficients_2"], coeff_final3["coefficients_3"],coeff_final4["coefficients_4"], coeff_final5["coefficients_5"] ], axis = 1)

#append mean and std error for each coefficient
coeff_table["mean_coeff"] = coeff_table.mean(axis = 1)

coeff_table["se_coeff"] = coeff_table[["features", "coefficients_1", "coefficients_2", "coefficients_3", "coefficients_4", "coefficients_5"]].sem(axis=1)

The final table looks as follows:

enter image description here

Can somebody show me how to generate the above table without writing all the lines of code above from test-train splits # 2 to test-train splits #5?


Solution

  • As you mentioned, you can do this with a for loop:

    # start by creating the first features column
    coeff_table = pd.DataFrame(X.columns, columns=["features"])
    
    # iterate over random states while keeping track of `i`
    for i, state in enumerate([11, 444, 21, 109, 1900]):
        train_x, test_x, train_y, test_y = train_test_split(
            X, y, stratify=y, test_size=0.3, random_state=state)
        log.fit(train_x, train_y) #fit final model 
    
        coeff_table[f"coefficients_{i+1}"] = np.transpose(log.coef_)
    

    Note that we are dropping the predict and predict_proba calls in this loop since those values are being thrown away (overwritten each time in your code), however you can add them back using similar logic in the loop to create new columns in your table.