Search code examples
pythonmachine-learningscikit-learnpipelinedata-transform

Is there a way to do transformation on features X based on true labels in y?


I have checked other questions covering the topic such as this, this, this, this and this as well as some great blog posts, blog1, blog2 and blog3 (kudos to respective author) but without success.

What I want to do is to transform rows whose values are under a certain threshold in X, but only those that correspond to some specific classes in the target y (y != 9). The threshold is calculated based on the other class (y == 9). However, I have problems understanding how to implement this properly.

As I want to do parameter tuning and cross-validation on this I will have to do the transformation using a pipeline. My custom transformer class looks like below. Note that I haven't included TransformerMixin as I believe I need to take into account for y in the fit_transform() function.

class CustomTransformer(BaseEstimator):

    def __init__(self, percentile=.90):
        self.percentile = percentile

    def fit(self, X, y):
        # Calculate thresholds for each column
        thresholds = X.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()

        # Store them for later use
        self.thresholds = thresholds
        return self

    def transform(self, X, y):
        # Create a copy of X
        X_ = X.copy(deep=True)

        # Replace values lower than the threshold for each column
        for p in self.thresholds:
            X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
        return X_

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X, y)

This is then fed into a pipeline and subsequent GridSearchCV. I provide a working example below.

imports...

# Create some example data to work with
random.seed(12)
target = [randint(1, 8) for _ in range(60)] + [9]*40
shuffle(target)
example = pd.DataFrame({'feat1': sample(range(50, 200), 100), 
                       'feat2': sample(range(10, 160), 100),
                       'target': target})
example_x = example[['feat1', 'feat2']]
example_y = example['target']

# Create a final nested pipeline where the data pre-processing steps and the final estimator are included
pipeline = Pipeline(steps=[('CustomTransformer', CustomTransformer(percentile=.90)),
                           ('estimator', RandomForestClassifier())])

# Parameter tuning with GridSearchCV
p_grid = {'estimator__n_estimators': [50, 100, 200]}
gs = GridSearchCV(pipeline, p_grid, cv=10, n_jobs=-1, verbose=3)
gs.fit(example_x, example_y)

Above code gives me the following error.

/opt/anaconda3/envs/Python37/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

TypeError: transform() missing 1 required positional argument: 'y'


I have also tried other approaches such as storing corresponding class indices during fit() and then use those during transform(). However, as the train and test index during cross-validation is not the same it gives an index error when values are replaced in transform().

So, is there a clever way to solve this?


Solution

  • In the comments I was talking about this:

    class CustomTransformer(BaseEstimator):
    
        def __init__(self, percentile=.90):
            self.percentile = percentile
    
        def fit(self, X, y):
            # Calculate thresholds for each column
    
            # We have appended y as last column in X, so remove that
            X_ = X.iloc[:,:-1].copy(deep=True)
    
            thresholds = X_.loc[y == 9, :].quantile(q=self.percentile, interpolation='linear').to_dict()
    
            # Store them for later use
            self.thresholds = thresholds
            return self
    
        def transform(self, X):
            # Create a copy of actual X, except the targets which are appended
    
            # We have appended y as last column in X, so remove that
            X_ = X.iloc[:,:-1].copy(deep=True)
    
            # Use that here to get y
            y =  X.iloc[:, -1].copy(deep=True)
    
            # Replace values lower than the threshold for each column
            for p in self.thresholds:
                X_.loc[y != 9, p] = X_.loc[y != 9, p].apply(lambda x: 0 if x < self.thresholds[p] else x)
            return X_
    
        def fit_transform(self, X, y):
            return self.fit(X, y).transform(X)
    

    And then change your X, y:

    # We are appending the target into X
    example_x = example[['feat1', 'feat2', 'target']]
    example_y = example['target']