Search code examples
pythonscikit-learnpreprocessor

Keeping track of the output columns in sklearn preprocessing


How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:

  1. What is the source variable of each column in the output array?
  2. If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
  3. What is the exact imputed value for each variable?
  4. What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)

I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.


Solution

  • The answer which had mentioned is based on this in Sklearn.

    You can get the answer for your first two question using the following snippet.

    def get_feature_names(columnTransformer):
    
        output_features = []
    
        for name, pipe, features in columnTransformer.transformers_:
            if name!='remainder':
                for i in pipe:
                    trans_features = []
                    if hasattr(i,'categories_'):
                        trans_features.extend(i.get_feature_names(features))
                    else:
                        trans_features = features
                output_features.extend(trans_features)
    
        return output_features
    
    import pandas as pd
    pd.DataFrame(preprocessor.fit_transform(X_train),
                columns=get_feature_names(preprocessor))
    

    enter image description here

    transformed_cols = get_feature_names(preprocessor)
    
    def get_original_column(col_index):
        return transformed_cols[col_index].split('_')[0]
    
    get_original_column(3)
    # 'embarked'
    
    get_original_column(0)
    # 'age'
    
    def get_category(col_index):
        new_col = transformed_cols[col_index].split('_')
        return 'no category' if len(new_col)<2 else new_col[-1]
    
    print(get_category(3))
    # 'Q'
    
    print(get_category(0))
    # 'no category'
    

    Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.