Search code examples
pythonpandasscikit-learnfeature-selectionfeature-engineering

Feature Engineering: Scaling for different distributions


I am trying to understand the best way to scale my features and learn how to use SciKit package to transform/fit on my predicting dataset.

I have 2 groups of data.

First group has normal distribution, so I am just looking to scale the values (positive values between 20-100) using minmax scaler.

Second group of features has outliers so I believe the robustscaler will give better results.

My question is

  1. Can I use multiple scalers on my dataset for a classification problem using RF?
  2. Within SciKit, when I try to scale just 1 feature using robustscaler on my training data, I am getting this error. ValueError: Expected 2D array, got 1D array instead: I am not sure how to read this error, can I not scale just one feature?
  3. If I using two scalers for my data, what is the best way to implement the feature engineering if I am looking to make predictions one row at a time? Do I just use transform?

Solution

    1. Yes you can if you find it useful.
    2. You can scale single feature. If you do something like this you will have an error:
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    df = pd.DataFrame({
        "feature1": [1,2,3,4,5],
        "feature2": [100, 200, 300, 400, 500],
        "feature3": [200, 300, 400, 500, 600],
    })
    
    scaler = StandardScaler()
    
    scaler.fit_transform(df["feature1"])
    
    # output
    ValueError: Expected 2D array, got 1D array instead:
    

    You need to additionally reshape input if this is single column:

    scaler = StandardScaler()
    
    scaler.fit_transform(df["feature1"].values.reshape(-1, 1))
    
    # output
    array([[-1.41421356],
           [-0.70710678],
           [ 0.        ],
           [ 0.70710678],
           [ 1.41421356]])
    
    1. You can branch preprocessing using ColumnTransformer.
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import StandardScaler, MinMaxScaler
    
    
    df = pd.DataFrame({
        "feature1": [1,2,3,4,5],
        "feature2": [100, 200, 300, 400, 500],
        "feature3": [200, 300, 400, 500, 600],
    })
    
    transformers = ColumnTransformer(
        transformers=[
            ("scaling1", MinMaxScaler(), ["feature1"]),
            ("scaling2", StandardScaler(), ["feature2", "feature3"])
        ]
    )
    
    transformed_df = transformers.fit_transform(df)
    
    transformed
    
    # output
    array([[ 0.        , -1.41421356, -1.41421356],
           [ 0.25      , -0.70710678, -0.70710678],
           [ 0.5       ,  0.        ,  0.        ],
           [ 0.75      ,  0.70710678,  0.70710678],
           [ 1.        ,  1.41421356,  1.41421356]])
    
    

    If you would like to for example use first scaler (scaling1) to inverse transform:

    scaler_1 = transformers.named_transformers_["scaling1"]
    scaler_1.inverse_transform(transformed[:, 0].reshape(-1, 1))
    
    # output
    array([[1.],
           [2.],
           [3.],
           [4.],
           [5.]])