I was wondering whether I am able to import feature_importances from let's say model1 to model2, such that I can then train model2 starting from these feature_importances, and let model2 influence these feature_importances to create a new set of "Mutated" feature_importances.
Thanks in advance
I tried just doing something like this
model2.feature_importances_ = model1.feature_importances_
but it just threw an error at me saying AttributeError: can't set attribute 'feature_importances_'
. Which is expectable.
I think you are confusing the feature_importances_
property with feature_weights
parameter in xgboost.
Feature importance is a post-statistic that is calculated in different ways such as mean decrease in impurity, feature permutation, etc. The goal of these statistics is to give a relative sense of which features were a better predictor for the prediction task at hand.
An example of a popular feature_importance_
statistic is called Gini importance
.
The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Details here.
feature_importances_
is a property which is calculated and saved as part of the model class instance after the model.fit()
method is called. You can NOT overwrite this attribute and even if you could, it would not allow you to make any changes to the model training like you are expecting.
What you need is feature_weights
parameter, which is a part of the xgboost.XGBRegressor.fit
and the xgboost.XGBClassifier.fit
.
feature_weights
(Optional[Any]) – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise, a ValueError is thrown.
Read more details on feature_weights
here. You have to make sure you use any of the colsample_*
parameters in the model instantiation and then pass your feature_weights
for each feature as an array to the .fit()
methods. By default, each of the feature_weights
are set to 1.
Here is an example of the usage of these paramters.
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
#Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data['data'],
data['target'],
test_size=.2)
#Train model
xgb = XGBClassifier(n_estimators=5,
objective='binary:logistic',
colsample_bytree = 0.7) #<--------
feature_weights = np.array([0.5, 0.3, 0.8, 0.1]) #<--------
xgb.fit(X_train, y_train, feature_weights=feature_weights) #<--------
preds = xgb.predict(X_test)
xgb.feature_importances_
array([0.09477486, 0.03003547, 0.77826285, 0.09692684], dtype=float32)
feature_importances_
from a previous model as feature_weights
?As long as your understand ...
feature_weights
for sampling/prioritizing specific columns,feature_importances_
are not probabilities, and feature_weights
are probabilities (don't need to sum up to 1 but have to be > 0),... I don't see why not.
You can pass feature_importances_
from say a previously run random forest model as feature priority for the xgboost model using the colsample_*
and feature_weights
parameters. After the model training, you can pull your new "mutated" feature_importances_
. Be careful of comparing these feature_importances_
across models thought.