Search code examples
pythonmachine-learningscikit-learndata-sciencetransformer-model

Custom Transformer in sklearn


I am building a transformer in sklearn which drops features that have a correlation coefficient lower than a specified threshold.

It works on the training set. However, when I transform the test set. All features on the test set disappear. I assume the transformer is calculating correlations between test data and training label and since those are all low, it is dropping all features. How do I make it only calculate correlations on the training set and drop those features from the test set on the transform?

class CorrelatedFeatures(BaseEstimator, TransformerMixin): #Selects only features that have a correlation coefficient higher than threshold with the response label
    def __init__(self, response, threshold=0.1):
        self.threshold = threshold
        self.response = response
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        df = pd.concat([X, self.response], axis=1)
        cols = df.columns[abs(df.corr()[df.columns[-1]]) > self.threshold].drop(self.response.columns)
        return X[cols]

Solution

  • You calculate and store that correlation and the columns to be dropped in fit(), and in transform() just transform those columns.

    Something like this:

    ....
    ....
    
    def fit(self, X, y=None):
        df = pd.concat([X, self.response], axis=1)
        self.cols = df.columns[abs(df.corr()[df.columns[-1]]) > self.threshold].drop(self.response.columns)
        return self
    def transform(self, X, y=None):
        return X[self.cols]