Search code examples
pythonpandasnumpyscikit-learnfeature-selection

RFECV with a pipeline containing ColumnTransformer


My question refers to a problem which has been raised in the following similar unanswered question: Using a Pipeline containing ColumnTransformer in SciKit's RFECV

I am trying to select the most relevant features with RFECV with a pipeline containing ColumnTransformer with the following code:

from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import  ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor as hr

ind_nums_area=list(extra_enc_area_tr_data.columns.get_indexer(nums_area))

huber_prep_pipe = Pipeline([
       ('scaler',StandardScaler())
   ])
huber_col_transf= ColumnTransformer ([
    ('prep',huber_prep_pipe,ind_nums_area) 
],remainder='passthrough')

huber_pipe = Pipeline([
    ('transf',huber_col_transf),
    ('est',hr())
])
huber_ttr=TransformedTargetRegressor(regressor=huber_pipe,transformer=MinMaxScaler())
min_features=100
huber_pipe_rfecv=RFECV(huber_ttr,min_features_to_select=min_features,
cv=5,scoring='neg_root_mean_squared_error',n_jobs=-1,verbose=3,
importance_getter='regressor_.named_steps.est.coef_')
huber_pipe_rfecv.fit(extra_enc_area_tr_data,log_tr_target)

ind_nums_area is a list of indices of features to be transformed by the ColumnTransformer (list of features is the nums_area variable). I am using indices as RFECV transforms the training dataset which is a pandas dataframe to a numpy array and column names are not allowed of course. Unfortunately the indices are not the way to go either as number of features are reduced by RFECV and listed indices are not correct. One of the features to be transformed by ColumnTransformer is indice 255 of the last feature of the training data and in that case I get an error IndexError: index 255 is out of bounds for axis 0 with size 255 ValueError: all features must be in [0, 254] or [-255, 0].

Do you have any suggestion how to solve this issue? Is there any simple way to find out which feature has been removed in each iteration of RFECV and adjust the the indices accordingly? Maybe you have a suggestion how to avoid converting the dataframe to a numpy array by RFECV? Otherwise do you know some alternative to sklearn RFCEV that does not convert the training dataframe to a numpy array?

I do not want transform all the data before passing it to the estimator as this would leak the scaler information to the test folds.

How to deal with this without data leakage?


Solution

  • I have been trying to solve this problem in a similar constellation for some time until I got fed up by the complicated internal conversions of scikit-learn and decided to write my own rfecv working with pipelined transformers.

    Basically I implemented the algorithm from Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." Machine learning 46.1 (2002): 389-422, that is also the base for the scikit-learn implementation.

    def rfecv(X, y, estimator,
              min_features_to_select=3, 
              splits=3,
              step=3,
              scoring_metric="f1",
              scoring_decimals=3,
              random_state=None):
        """
        This method is an implementation the recursive feature eliminationalgorithm, 
        which eliminates unneccessary features. As scikit-learn only provides an RFECV 
        version [1] that makes using Pipelines very difficult, we have implemented our 
        own version based on the original paper [2].
        
        [1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html
        [2] Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." 
            Machine learning 46.1 (2002): 389-422.
    
        :X: a DataFrame containing the features.
        :y: a Series containing the labels.
        :estimator: a scikit-learn estimator or a Pipeline. If a pipeline is passed,
            the last element of the pipeline is assumed to be a classifier providing
            a feature_importances_ attribute.
        :min_features_to_select: the minimum number of features to evaluate.
        :split: number of splits for to use for cross validation.
        :step: the amount of features to be reduced during each step.
        :scoring_metric: the scoring metric to use for evaluation (e.g., "f_one" or 
            a callable implementing the sklearn scoring interface).
        :scoring_decimals: the scoring metric can be rounded to N decimals to avoid 
            the reduction from getting stuck with a larger number of features with
            very small score gains. Defaults to 3 digits. If None is passed, full
            scoring precision is used.
        :random_state: if not None, this is the seed for all RNGs used in this function.
            
        :returns: best_features, best_score, ranks, scores; best_features is a list
            of features, best_score is the mean score achieved with these features over the
            folds, ranks is the order of eliminated features (from most relevant to most irrelevant),
            scores is the list of mean scores for each step achieved during the feature 
            elimination across all folds.
        """
        # Initialize survivors and ranked list
        survivors = list(X.columns)
        ranks = []
        scores = []
        
        # While the survivor list is longer than min_features_to_select
        while len(survivors) >= min_features_to_select:
                    
            # Get only the surviving features
            X_tmp = X[survivors]
            
            # Train and get the scores, cross_validate clones 
            # the model internally, so this does not modify
            # the estimator passed to this function
            print("[%.2f] evaluating %i features ..." % (time(), len(X_tmp.columns)))
            cv_result = cross_validate(estimator, X_tmp, y,
                                       cv=KFold(n_splits=splits, 
                                                shuffle=True, 
                                                random_state=random_state),
                                       scoring=scoring_metric,
                                       return_estimator=True)
            
            # Append the mean performance to 
            score = np.mean(cv_result["test_score"])
            if scoring_decimals is None:
                scores.append(score)
            else:
                scores.append(round(score, scoring_decimals))            
            print("[%.2f] ... score %f." % (time(), scores[-1]))
            
            # Get feature weights from the model fitted 
            # on the best fold and square the weights as described 
            # in the paper. If the estimator is a Pipeline,
            # we get the weights from the last element.
            best_estimator = cv_result["estimator"][np.argmax(cv_result["test_score"])]
            if isinstance(best_estimator, Pipeline):
                weights = best_estimator[-1].feature_importances_
            else:
                weights = best_estimator.feature_importances_
            weights = list(np.power(weights, 2))
                    
            # Remove step features (but respect min_features_to_select)
            for _ in range(max(min(step, len(survivors) - min_features_to_select), 1)):
                
                # Find the feature with the smallest ranking criterion
                # and update the ranks and survivors
                idx = np.argmin(weights)
                ranks.insert(0, survivors.pop(idx))
                weights.pop(idx)
                
        # Calculate the best set of surviving features
        ranks_reverse = list(reversed(ranks))
        last_max_idx = len(scores) - np.argmax(list(reversed(scores))) - 1
        removed_features = set(ranks_reverse[0:last_max_idx * step])
        best_features = [f for f in X.columns if f not in removed_features]
        
        # Return ranks and scores
        return best_features, max(scores), ranks, scores
    

    Everything you need to know is documented in the docstring. The only exception is how to interprete the returned ranks and scores lists. In the case of step being 1, the score at scores[i] was achieved by removing all the features in list(reversed(ranks))[0:i] (as ranks is the list of removed features ordered from most relevant to most irrelevant).

    A minimum working example with a DecisionTree is shown below (but it of course also works with Pipelines and transformers, if the classifier in the Pipeline is the last element):

    Python 3.9.1 (default, Dec 11 2020, 14:32:07) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from sklearn.datasets import load_breast_cancer
    >>> from sklearn.tree import DecisionTreeClassifier
    >>> test_data = load_breast_cancer(as_frame=True)
    >>> clf = DecisionTreeClassifier(random_state=0)
    >>> clf.fit(test_data.data, test_data.target)
    DecisionTreeClassifier(random_state=0)
    >>> best_features, best_score, _, _ = rfecv(test_data.data, test_data.target, clf, step=1, min_features_to_select=1, random_state=0)
    [1626774242.35] evaluating 30 features ...
    [1626774242.38] ... score 0.944000.
    [1626774242.38] evaluating 29 features ...
    [1626774242.42] ... score 0.938000.
    [1626774242.42] evaluating 28 features ...
    [1626774242.47] ... score 0.948000.
    [1626774242.47] evaluating 27 features ...
    [1626774242.50] ... score 0.934000.
    [1626774242.51] evaluating 26 features ...
    [1626774242.54] ... score 0.938000.
    [1626774242.54] evaluating 25 features ...
    [1626774242.58] ... score 0.939000.
    [1626774242.58] evaluating 24 features ...
    [1626774242.62] ... score 0.941000.
    [1626774242.62] evaluating 23 features ...
    [1626774242.65] ... score 0.944000.
    [1626774242.65] evaluating 22 features ...
    [1626774242.68] ... score 0.953000.
    [1626774242.68] evaluating 21 features ...
    [1626774242.70] ... score 0.940000.
    [1626774242.70] evaluating 20 features ...
    [1626774242.72] ... score 0.941000.
    [1626774242.72] evaluating 19 features ...
    [1626774242.75] ... score 0.943000.
    [1626774242.75] evaluating 18 features ...
    [1626774242.77] ... score 0.942000.
    [1626774242.77] evaluating 17 features ...
    [1626774242.79] ... score 0.944000.
    [1626774242.79] evaluating 16 features ...
    [1626774242.80] ... score 0.945000.
    [1626774242.80] evaluating 15 features ...
    [1626774242.82] ... score 0.935000.
    [1626774242.82] evaluating 14 features ...
    [1626774242.84] ... score 0.935000.
    [1626774242.84] evaluating 13 features ...
    [1626774242.86] ... score 0.947000.
    [1626774242.86] evaluating 12 features ...
    [1626774242.87] ... score 0.950000.
    [1626774242.87] evaluating 11 features ...
    [1626774242.89] ... score 0.950000.
    [1626774242.89] evaluating 10 features ...
    [1626774242.91] ... score 0.944000.
    [1626774242.91] evaluating 9 features ...
    [1626774242.92] ... score 0.948000.
    [1626774242.92] evaluating 8 features ...
    [1626774242.94] ... score 0.953000.
    [1626774242.94] evaluating 7 features ...
    [1626774242.95] ... score 0.953000.
    [1626774242.95] evaluating 6 features ...
    [1626774242.97] ... score 0.949000.
    [1626774242.97] evaluating 5 features ...
    [1626774242.98] ... score 0.951000.
    [1626774242.98] evaluating 4 features ...
    [1626774243.00] ... score 0.947000.
    [1626774243.00] evaluating 3 features ...
    [1626774243.01] ... score 0.950000.
    [1626774243.01] evaluating 2 features ...
    [1626774243.02] ... score 0.942000.
    [1626774243.02] evaluating 1 features ...
    [1626774243.03] ... score 0.911000.
    >>> print(best_features, best_score)
    ['area error', 'smoothness error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points'] 0.953
    

    Regards,

    Matthias