Search code examples
pythonpandasscikit-learntransformminmax

MinMax Scaler using column transformer ( the transformed columns are shifted front)


I am trying to build a model on House Prices - Advanced Regression Techniques data set (1460, 80). It has 37 Numerical Features and 43 Categorical Features.

I want to Scale the Numerical Feature first then. One_hot_encode the Categorical Feature. I am using MinMax scaler along with Column transformer.

after scaling the data, the DataFrame is not retaining the column names

Here is my code

columns_transform_sc=make_column_transformer((MinMaxScaler(),['MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold']),remainder="passthrough")

sc_df=columns_transform_sc.fit_transform(x_train)

I used the original dataframe's(x_train) columns for the scaled dataframe(sc_df).

sc_df=pd.DataFrame(sc_df,index=x_train.index,columns=x_train.columns)

The problem I'm facing is that the column transformer shifts all the columns that it has transformed to the front and shifts the passthrough columns back, and I can't use x_train.columns to replace the sc_df.columns

enter image description here enter image description here

All the Categories feature has been shifted back. Is there a way to retain the column names of getting the column names

also Should I encode the categorical feature (one_hot_encode or label_encode) first, then Scale(Standardize or normalize) the entire thing (the encoded data too) or scale then encode


Solution

  • I think you can - and sometimes have first to do the scaling. I suggest trying this:

    qt = QuantileTransformer(n_quantiles=50, output_distribution='normal', random_state=0)
    df.Betrag = qt.fit_transform(df.Betrag.values.reshape(-1, 1))
    

    Note: You can replace the one column directly with a slice of columns with the known standard Syntax for selecting a subset of Pandas DataFrame columns:

    age_sex = titanic[["Age", "Sex"]]
    

    In this case, you would pass age_sex to the fit and the transform function if we assume that these columns the definite ones. Even more, you are not restricted to the QuantileTransformer. The code should work generically for all Transformers.

    EDIT: Sorry, quick sidenote: The reshape operation is just necessary if you pass a tensor with just one particular feature to the QuantileTransformer. In the case of a multi-feature tensor and another transformer, it should be necessary.