Search code examples
pythonscikit-learnnumpy-ndarrayscalingtransformer-model

How to preserve column order after applying sklearn.compose.ColumnTransformer on numpy array


I want to use Pipeline and ColumnTransformer modules from sklearn library to apply scaling on numpy array. Scaler is applied on some of the columns. And, I want to have the output with same column order of input.

Example:

import numpy as np
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import  MinMaxScaler


X = np.array ( [(25, 1, 2, 0),
                (30, 1, 5, 0),
                (25, 10, 2, 1),
                (25, 1, 2, 0),
                (np.nan, 10, 4, 1),
                (40, 1, 2, 1) ] )



column_trans = ColumnTransformer(
    [ ('scaler', MinMaxScaler(), [0,2]) ], 
     remainder='passthrough') 
      
X_scaled = column_trans.fit_transform(X)

The problem is that ColumnTransformer changes the order of columns. How can I preserve the original order of columns?

I am aware of this post. But, it is for pandas DataFrame. For some reasons, I cannot use DataFrame and I have to use numpy array in my code.

Thanks.


Solution

  • Here is a solution by adding a transformer which will apply the inverse column permutation after the column transform:

    from sklearn.base import BaseEstimator, TransformerMixin
    import re
    
    
    class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
        index_pattern = re.compile(r'\d+$')
        
        def __init__(self, column_transformer):
            self.column_transformer = column_transformer
            
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            order_after_column_transform = [int( self.index_pattern.search(col).group()) for col in self.column_transformer.get_feature_names_out()]
            order_inverse = np.zeros(len(order_after_column_transform), dtype=int)
            order_inverse[order_after_column_transform] = np.arange(len(order_after_column_transform))
            return X[:, order_inverse]
    

    It relies on parsing

    column_trans.get_feature_names_out()
    # = array(['scaler__x1', 'scaler__x3', 'remainder__x0', 'remainder__x2'],
    #      dtype=object)
    

    to read the initial column order from the suffix number. Then computing and applying the inverse permutation.

    To be used as:

    import numpy as np
    from sklearn.compose import ColumnTransformer 
    from sklearn.preprocessing import  MinMaxScaler
    from sklearn.pipeline import make_pipeline
    
    X = np.array ( [(25, 1, 2, 0),
                    (30, 1, 5, 0),
                    (25, 10, 2, 1),
                    (25, 1, 2, 0),
                    (np.nan, 10, 4, 1),
                    (40, 1, 2, 1) ] )
    
    
    
    column_trans = ColumnTransformer(
        [ ('scaler', MinMaxScaler(), [0,2]) ], 
         remainder='passthrough') 
    
    pipeline = make_pipeline( column_trans, ReorderColumnTransformer(column_transformer=column_trans))
    X_scaled = pipeline.fit_transform(X)
    #X_scaled has same column order as X
    

    Alternative solution not relying on string parsing but reading the column slices of the column transformer:

    from sklearn.base import BaseEstimator, TransformerMixin
    
    
    class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
        
        def __init__(self, column_transformer):
            self.column_transformer = column_transformer
            
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            slices = self.column_transformer.output_indices_.values()
            n_cols = self.column_transformer.n_features_in_
            order_after_column_transform = [value for slice_ in slices for value in range(n_cols)[slice_]]
            
            order_inverse = np.zeros(n_cols, dtype=int)
            order_inverse[order_after_column_transform] = np.arange(n_cols)
            return X[:, order_inverse]