Search code examples
pythonfor-loopmachine-learningrocauc

How to create a few Machine Learning models through all variables and after each iteration next XGBClassifier is created with 1 less var in Python?


I have DataFrame in Python Pandas like below:

Input data:

  • Y - binary target

  • X1...X5 - predictors

    Y X1 X2 X3 X4 X5
    1 111 22 1 0 150
    0 12 33 1 0 222
    1 150 44 0 0 230
    0 270 55 0 1 500
    ... ... ... ... ... ...

Requirements: And I need to:

  • run a loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model
  • So, if I have for example 5 predictors (X1...X5) I need to create 5 XGBoost classification models, and in in each successive model there must be 1 less variable
  • Each model should be evaluated by roc_auc_score
  • As an output I need: list_of_models = [] where will be saved created models and DataFrame with AUC on train and test

Desire output:

So, as a result I need to have something like below

  • Model - position of model in list_of_models

  • Num_var - number of predictors used in model

  • AUC_train - roc_auc_score on train dataset

  • AUC_test - roc_auc_score on test dataset

    Model Num_var AUC_train AUC_test
    0 5 0.887 0.884
    1 4 0.875 0.845
    2 3 0.854 0.843
    3 2 0.965 0.928
    4 1 0.922 0.921

My draft: which is wrong because it should be loop through all the variables in such a way that after each iteration a new XGBoost classification model is created and also after each iteration one of the variables is discarded and create next model

X_train, X_test, y_train, y_test = train_test_split(df.drop("Y", axis=1)
                                                    , df.Y
                                                    , train_size = 0.70
                                                    , test_size=0.30
                                                    , random_state=1
                                                    , stratify = df.Y)

results = []
list_of_models = []

for val in X_train:

    model = XGBClassifier()
    model.fit(X_train, y_train)
    list_of_models.append(model)

    preds_train = model.predict(X_train)
    preds_test = model.predict(X_test)
    preds_prob_train = model.predict_proba(X_train)[:,1]
    preds_prob_test = model.predict_proba(X_test)[:,1]

    results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
                     "AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})

results = pd.DataFrame(results)

How can I do that in Python ?


Solution

  • You want to make your data narrower during each loop? If I understand this correctly you could do something like this:

    results = []
    list_of_models = []
    
    for i in X_train.columns:
        model = XGBClassifier()
        model.fit(X_train, y_train)
        list_of_models.append(model)
    
        preds_train = model.predict(X_train)
        preds_test = model.predict(X_test)
        preds_prob_train = model.predict_proba(X_train)[:,1]
        preds_prob_test = model.predict_proba(X_test)[:,1]
        results.append({("AUC_train":round(metrics.roc_auc_score(y_train,preds_prod_test),3),
                     "AUC_test":round(metrics.roc_auc_score(y_test,preds_prod_test),3})
        X_train = X_train.drop(i, axis=1)
        X_test = X_test.drop(i, axis=1)
    
    results = pd.DataFrame(results)