Search code examples
pythonmachine-learningscikit-learnartificial-intelligencexgboost

Is there anyway I can import my own feature_importances into a model?


I was wondering whether I am able to import feature_importances from let's say model1 to model2, such that I can then train model2 starting from these feature_importances, and let model2 influence these feature_importances to create a new set of "Mutated" feature_importances.

Thanks in advance

I tried just doing something like this model2.feature_importances_ = model1.feature_importances_ but it just threw an error at me saying AttributeError: can't set attribute 'feature_importances_'. Which is expectable.


Solution

  • I think you are confusing the feature_importances_ property with feature_weights parameter in xgboost.


    Feature importances are a post-statistic not a model parameter

    Feature importance is a post-statistic that is calculated in different ways such as mean decrease in impurity, feature permutation, etc. The goal of these statistics is to give a relative sense of which features were a better predictor for the prediction task at hand.

    An example of a popular feature_importance_ statistic is called Gini importance.

    The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Details here.

    feature_importances_ is a property which is calculated and saved as part of the model class instance after the model.fit() method is called. You can NOT overwrite this attribute and even if you could, it would not allow you to make any changes to the model training like you are expecting.


    Feature weights can help "prioritize" a given feature during training

    What you need is feature_weights parameter, which is a part of the xgboost.XGBRegressor.fit and the xgboost.XGBClassifier.fit.

    feature_weights (Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise, a ValueError is thrown.

    Read more details on feature_weights here. You have to make sure you use any of the colsample_* parameters in the model instantiation and then pass your feature_weights for each feature as an array to the .fit() methods. By default, each of the feature_weights are set to 1.

    Here is an example of the usage of these paramters.

    from xgboost import XGBClassifier
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    
    #Load data
    data = load_iris()
    X_train, X_test, y_train, y_test = train_test_split(data['data'], 
                                                        data['target'], 
                                                        test_size=.2)
    
    #Train model
    xgb = XGBClassifier(n_estimators=5, 
                        objective='binary:logistic', 
                        colsample_bytree = 0.7)                 #<--------
    
    feature_weights = np.array([0.5, 0.3, 0.8, 0.1])            #<--------
    xgb.fit(X_train, y_train, feature_weights=feature_weights)  #<--------
    
    
    preds = xgb.predict(X_test)
    xgb.feature_importances_
    
    array([0.09477486, 0.03003547, 0.77826285, 0.09692684], dtype=float32)
    

    Can you use the feature_importances_ from a previous model as feature_weights?

    As long as your understand ...

    1. how the colsample_* parameters use these feature_weights for sampling/prioritizing specific columns,
    2. that feature_importances_ are not probabilities, and feature_weights are probabilities (don't need to sum up to 1 but have to be > 0),

    ... I don't see why not.

    You can pass feature_importances_ from say a previously run random forest model as feature priority for the xgboost model using the colsample_* and feature_weights parameters. After the model training, you can pull your new "mutated" feature_importances_. Be careful of comparing these feature_importances_ across models thought.