Search code examples
pythonpandasscikit-learnscaledata-science

Apply StandardScaler to parts of a data set


I want to use sklearn's StandardScaler. Is it possible to apply it to some feature columns but not others?

For instance, say my data is:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

I fit and transform the data

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit and transform methods only on the columns Age and Weight?


Solution

  • Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

    import pandas as pd
    data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
    
    col_names = ['Name', 'Age', 'Weight']
    features = data[col_names]
    
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import StandardScaler
    
    ct = ColumnTransformer([
            ('somename', StandardScaler(), ['Age', 'Weight'])
        ], remainder='passthrough')
    
    ct.fit_transform(features)
    

    NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

    Output

    -1.41100443,  1.20270298,  3.       
     0.62304092,  0.04295368,  4.       
     0.78796352, -1.24565666,  6.