Search code examples
pythonscikit-learnneuraxle

Neuraxle Select Columns in Pandas DataFrame


Whats the NeurAxle way to select a subset of columns from a dataset? This is how i am doing it via sklearn:

class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X[self.columns]



# Set up SIMPLE FEATURES
simple_cols = ['BEDCERT', 'RESTOT', 'INHOSP', 'CCRC_FACIL',
               'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS',
               'EXP_TOTAL', 'ADJ_TOTAL']
    
simple_features = Pipeline([
    ('cst', ColumnSelectTransformer(simple_cols)),
    ('impute', SimpleImputer())
])

EDIT:-

I think this is one solution but im not 100% convinced.

class ColumnSelectTransformer(BaseTransformer, ForceHandleMixin):

    def __init__(self, required_columns):
        BaseTransformer.__init__(self)
        ForceHandleMixin.__init__(self)
        self.required_columns = required_columns

    def inverse_transform(self, processed_outputs):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X[self.required_columns]

Solution

  • Update: this was fixed. See usage example of the column transformer here: https://www.neuraxle.org/stable/examples/sklearn/plot_cyclical_feature_engineering.html#sphx-glr-examples-sklearn-plot-cyclical-feature-engineering-py


    There is already an issue for this: https://github.com/Neuraxio/Neuraxle/issues/168

    I would be tempted to not use Pandas for now, and instead use the provided ColumnTransformer: https://www.neuraxle.org/stable/api/neuraxle.steps.column_transformer.html

    If you get to fully code (and properly unit test) your Pandas Transformer, we'd be glad to have your contribution by opening a pull request on Neuraxle and adding you as a contributor.

    Until then, you could code a simple PandasToNumpy step that would return the .values in a call to transform, and then using the existing ColumnTransformer of Neuraxle by providing the integers of the desired columns instead of the strings.

    Also note that you can inherit from the NonFittableMixin to override the fit as a return self without additional code.