I would like to combine the model/feature coefficients from multiple (random) test-train splits into a single dataframe in python.
Currently, my approach this is to generate model coefficients for each test-train split one at a time and then combining them at the end of the code.
While this works, this is excessively verbose and not feasible to extend to very large number of test-train splits.
Can somebody simplify my approach with a simple for loop perhaps? My inelegant, excessively verbose, code follows below:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
#####test_train split #1
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=11)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final1 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final1.columns=("features", "coefficients_1")
######test_train split #2
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=444)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final2 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final2.columns=("features", "coefficients_2")
#####test_train split #3
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=21)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final3 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final3.columns=("features", "coefficients_3")
#####test_train split #4
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=109)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final4 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final4.columns=("features", "coefficients_4")
#####test_train split #5
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=1900)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final5 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final5.columns=("features", "coefficients_5")
#######Append features/coefficients & odds ratios across 5 test-train splits
#append all coefficients into a single dataframe
coeff_table = pd.concat([coeff_final1, coeff_final2["coefficients_2"], coeff_final3["coefficients_3"],coeff_final4["coefficients_4"], coeff_final5["coefficients_5"] ], axis = 1)
#append mean and std error for each coefficient
coeff_table["mean_coeff"] = coeff_table.mean(axis = 1)
coeff_table["se_coeff"] = coeff_table[["features", "coefficients_1", "coefficients_2", "coefficients_3", "coefficients_4", "coefficients_5"]].sem(axis=1)
The final table looks as follows:
Can somebody show me how to generate the above table without writing all the lines of code above from test-train splits # 2 to test-train splits #5?
As you mentioned, you can do this with a for loop:
# start by creating the first features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.3, random_state=state)
log.fit(train_x, train_y) #fit final model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log.coef_)
Note that we are dropping the predict
and predict_proba
calls in this loop since those values are being thrown away (overwritten each time in your code), however you can add them back using similar logic in the loop to create new columns in your table.