Search code examples
pythonscikit-learnpipelineimputation

What is the best way to implement Pipeline to make sure train and test dummy variables are the same?


I am building a custom transformer that implements a couple of steps to preprocess data. The first is that it applies a set of functions that I wrote that will take existing features and engineer new ones. From there, the categorical variables will be one-hot encoded. The last step will be to drop features or columns from the DataFrame that are no longer needed.

The dataset I'm using is the Kaggle House Prices dataset.

The problem here is ensuring the categorical dummied variables in the test set are the same as the training set because some of the categories for a certain feature in the training set might not be in the test set and therefore the test set won't have a dummy variable for that category. I've done research and I ran into this solution and I'm trying to implement the first answer in my custom transformer class. First, I'm not sure if this is the best way to do it. Second I'm getting an error talked about below.

I've included the full list of the functions I apply to the data but only show a couple of the actual functions below.

class HouseFeatureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, funcs, func_cols, drop_cols, drop_first=True):

        self.funcs = funcs
        self.func_cols = func_cols

        self.train_cols = None
        self.drop_cols = drop_cols

        self.drop_first = drop_first

    def fit(self, X, y=None):

        X_trans = self.apply_funcs(X)
        X_trans.drop(columns=self.drop_cols, inplace=True)
        #save training_columns to compare to columns of any later seen dataset
        self.train_cols = X_trans.columns

        return self

    def transform(self, X, y=None):

        X_test = self.apply_funcs(X)
        X_test.drop(columns=self.drop_cols, inplace=True)
        test_cols = X_test.columns

        #ensure that all columns in the training set are present in the test set
        #set should be empty for first fit_transform
        missing_cols = set(self.train_cols) - set(test_cols)
        for col in missing_cols:
            X_test[col] = 0

        #reduce columns in test set to only what was in the training set
        X_test = X_test[self.train_cols]

        return X_test.values

    def apply_funcs(self, X):

        #apply each function to respective column
        for func, func_col in zip(self.funcs, self.func_cols):
            X[func_col] = X.apply(func, axis=1)

        #one hot encode categorical variables    
        X = pd.get_dummies(X, drop_first=self.drop_first)

        return X

#functions to apply
funcs = [sold_age, yrs_remod, lot_shape, land_slope, rfmat, bsmt_bath, baths, 
                other_rooms, fence_qual, newer_garage]
#feature names
func_cols = ['sold_age', 'yr_since_remod', 'LotShape', 'LandSlope', 'RoofMatl', 'BsmtBaths', 'Baths', \
                'OtherRmsAbvGr', 'Fence', 'newer_garage']

#features to drop
to_drop = ['Alley', 'Utilities', 'Condition2', 'HouseStyle', 'LowQualFinSF', 'EnclosedPorch', \
    '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'MiscFeature', 'MiscVal', \
    'YearBuilt', 'YrSold', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', \
    'TotRmsAbvGrd', 'GarageYrBlt', '1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'ExterQual', \
    'ExterCond', 'BsmtQual', 'BsmtCond', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'BsmtFinType2', \
    'Exterior1st', 'Exterior2nd', 'GarageCars', 'Functional', 'SaleType', 'SaleCondition']

#functions to transform data
def sold_age(row):
    '''calculates the age of the house when it was sold'''
    return row['YrSold'] - row['YearBuilt']

def yrs_remod(row):
    '''calculates the years since house was remodeled'''
    yr_blt = row['YearBuilt']
    yr_remodeled = row['YearRemodAdd']
    yr_sold = row['YrSold']
    if yr_blt == yr_remodeled:
        return 0
    else:
        return yr_sold - yr_remodeled

def lot_shape(row):
    '''consolidates all irregular categories into one'''
    if row['LotShape'] == 'Reg':
        return 'Reg'
    else:
        return 'Irreg'

During the fit, I apply the functions, dummy the categoricals, drop the unneeded columns, then save the columns to self.train_cols. When I do the transformation, I do the same steps except I save the transformed columns to test_cols. I compare these columns to the columns obtained in the fit and add any missing columns from test set that was in the training as shown in the answer I linked. The error I get is below:

KeyError: "['Alley' 'Utilities' 'Condition2' 'HouseStyle' 'PoolQC' 'MiscFeature'\n 'ExterQual' 'ExterCond' 'BsmtQual' 'BsmtCond' 'KitchenQual' 'FireplaceQu'\n 'GarageQual' 'GarageCond' 'BsmtFinType2' 'Exterior1st' 'Exterior2nd'\n 'Functional' 'SaleType' 'SaleCondition'] not found in axis"

I'm trying to understand why I'm getting this error and if there's a better way to implement this process than how I'm doing it.


Solution

  • Here are few things I noted in you code which may help

    • Error is complaining that some the columns you are trying to drop doesn't exist on the dataframe. To fix this you can replace code to drop columns with
    data = np.random.rand(50,4)
    df = pd.DataFrame(data, columns=["a","b","c","d"])
    drop_columns=['b', 'c', 'e', 'f']
    
    
    ## code to drop columns
    columns = df.columns
    drop_columns = set(columns) & set(drop_columns)
    df.drop(columns=drop_columns, inplace=True)
    
    • Fit function is only used to infer transformation parameters from train data. And is called only with train data. In your case you are only inferring the remaining columns on training data after applying functions and dropping the specified columns. For which you don't need to actually apply the functions. As you know what columns each function adds and what columns you need to drop. You can find it only using some set operations on the columns.

    • You can also simplify the transform function, you already know which columns to include so you 1st add missing columns than take only the columns you want to include instead of dropping columns