Search code examples
pythonloopsfor-loopmachine-learningfeature-selection

How to make loop with features selection by features importance where deleted features with imp = 0 or below mean imp in each iteration in Python?


I have DataFrame in Python Pandas like below:

Input data:

  • Y - binary target
  • X1...X5 - predictors

Source code of DataFrame:

import pandas as pd
import numpy as np

from xgboost import XGBClassifier

df = pd.DataFrame()
df["Y"] = [1,0,1,0]
df["X1"] = [111,12,150,270]
df["X2"] = [22,33,44,55]
df["X3"] = [1,1,0,0]
df["X4"] = [0,0,0,1]
df["X5"] = [150, 222,230,500]

Y   | X1  | X2  | X3    | X4    | X5  | ...  | Xn
----|-----|-----|-------|-------|-----|------|-------
1   | 111 | 22  | 1     | 0     | 150 | ...  | ...
0   | 12  | 33  | 1     | 0     | 222 | ...  | ...
1   | 150 | 44  | 0     | 0     | 230 | ...  | ...
0   | 270 | 55  | 0     | 1     | 500 | ...  | ...

And I make features selection by deleting features with importance = 0 in each iteration or if the are not features with imporance = 0 I delete features with importance below mean importance in that iteraton:

First iteration:

model_importance = XGBClassifier()
model_importance.fit(X = df.drop(labels=["Y"], axis=1), y = df["Y"])

importances = pd.DataFrame({"Feature":df.drop(labels=["Y"], axis=1).columns,
                            "Importance":model_importance.feature_importances_})

importances_to_drop_1 = importances[importances["Importance"]==0].index.tolist()

df.drop(columns = importances_to_drop_1, axis = 1, inplace = True)

Second iteration:

model_importance_2 = XGBClassifier()
model_importance_2.fit(X = df.drop(labels=["Y"], axis=1), y = df["Y"])

importances_2 = pd.DataFrame({"Feature":df.drop(labels=["Y"], axis=1).columns,
                            "Importance":model_importance_2.feature_importances_})

importances_to_drop_2 = importances_2[importances_2["Importance"]<importances_2.Importance.mean()].index.tolist()

df.drop(columns = importances_to_drop_2, axis = 1, inplace = True)

Requirements:

  • I need to create loop where in each iteration will be deleted features with importance = 0 or if there are not features with importance = 0 is some iteration delete features with importance below mean importance in that iteration
  • At the end I need to have at least 150 features
  • I need that in one loop (one segment of code) not like now in a few segments of code

How can I do that in Python ?


Solution

  • Add a for loop to iterate a set number of times and then use a conditional to drop using method 1 or 2 depending if method one has any importances=0 or not.

    iterations = 20
    for i in range(iterations):
        model_importance = XGBClassifier()
        model_importance.fit(X = df.drop(labels=["Y"], axis=1), y = df["Y"])
    
        importances = pd.DataFrame({"Feature":df.drop(labels=["Y"], axis=1).columns,
                                "Importance":model_importance.feature_importances_})
    
        importances_to_drop_1 = importances[importances["Importance"]==0].index.tolist()
        if len(df.columns) - len(importances_to_drop_1) <= 150:
            break
    
        if len(importances_to_drop_1) > 0:
            df.drop(columns = importances_to_drop_1, axis = 1, inplace = True)
        else:
            importances_to_drop_2 = importances_2[importances_2["Importance"]<importances_2.Importance.mean()].index.tolist()
            
            if len(df.columns) - len(importances_to_drop_2) <= 150:
                break
    
            df.drop(columns = importances_to_drop_2, axis = 1, inplace = True)