Search code examples
scikit-learnpipelinecategorical-datasklearn-pandasone-hot-encoding

understanding how onehotencoder works - why do i get mutliple ones in ohe column?


I am using sklearn pipelines to perform one-hot encoding:

preprocess = make_column_transformer(
    (MinMaxScaler(),numeric_cols),
    (OneHotEncoder(),['country'])
    )

param_grid =    { 
                  'xgbclassifier__learning_rate': [0.01,0.005,0.001],
                 
                  }

model = make_pipeline(preprocess,XGBClassifier())

# Initialize Grid Search Modelg
model = GridSearchCV(model,param_grid = param_grid,scoring = 'roc_auc',
                                 verbose= 1,iid= True,
                                     refit = True,cv  = 3)
model.fit(X_train,y_train)

To see then how the countries are one hot encoded I get the following ( I know there are two)

pd.DataFrame(preprocess.fit_transform(X_test))

The result of this is:

enter image description here

A few questions:

  • now correct me if wrong but in one hot encoding I thought it was a series of all 0's and just ONE number 1. why do I get several ones in one column
  • when I do model.predict(x_test) it applies the trasnformations as defined in the piepline fom training?
  • how do I retrieve the feature names when I call fit_transform?

Solution

  • To help you better understand (1), i.e. how OHE works.

    Suppose you have 1 column with categorical data:

    df = pd.DataFrame({"categorical": ["a","b","a"]})
    print(df)
      categorical
    0           a
    1           b
    2           a
    

    Then you'll get one 1 per row (this will always be true for one column categorical data), but not necessarily on a per column basis:

    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder()
    ohe.fit(df)
    ohe_out = ohe.transform(df).todense()
    # ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(df.columns))
    ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(["categorical"]))
    print(ohe_df)
       categorical_a  categorical_b
    0            1.0            0.0
    1            0.0            1.0
    2            1.0            0.0
    

    Should you add more data columns, e.g. a numerical column, this will hold true on a per column basis, but not for the whole row anymore:

    df = pd.DataFrame({"categorical":["a","b","a"],"nums":[0,1,0]})
    print(df)
      categorical  nums
    0           a     0
    1           b     1
    2           a     0
    

    ohe.fit(df)
    ohe_out = ohe.transform(df).todense()
    # ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(df.columns))
    ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(["categorical","nums"]))
    print(ohe_df)
       categorical_a  categorical_b  nums_0  nums_1
    0            1.0            0.0     1.0     0.0
    1            0.0            1.0     0.0     1.0
    2            1.0            0.0     1.0     0.0