Search code examples

Prediction After One-hot encoding

I am trying with a sample dataFrame :

data = [['Alex','USA',0],['Bob','India',1],['Clarke','SriLanka',0]]

df = pd.DataFrame(data,columns=['Name','Country','Traget'])

Now from here, I used get_dummies to convert string column to an integer:


one_hot = pd.get_dummies(df[column_names])  

After conversion the columns are: Age,Name_Alex,Name_Bob,Name_Clarke,Country_India,Country_SriLanka,Country_USA

Slicing the data.



Splitting the dataset in train and test

from sklearn.cross_validation import train_test_split


Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(), y_train)

Now, model is trained.

For prediction let say i want to predict the "target" by giving "Name" and "Country".
Like : ["Alex","USA"].


If I used this:


obviously it will not work.

Question1) How to test the prediction after applying one-hot encoding during training?

Question2) How to do prediction on a sample csv file which contains only "Name" and "Country"?


  • I suggest you to use sklearn label encoders and one hot encoder packages instead of pd.get_dummies.

    Once you initialise label encoder and one hot encoder per feature then save it somewhere so that when you want to do prediction on the data you can easily import saved label encoders and one hot encoders and encode your features again.

    This way you are encoding your features again in the same way as you did while making training set.

    Below is the code which I use for saving encoders:

    labelencoder_dict = {}
    onehotencoder_dict = {}
    X_train = None
    for i in range(0, X.shape[1]):
        label_encoder = LabelEncoder()
        labelencoder_dict[i] = label_encoder
        feature = label_encoder.fit_transform(X[:,i])
        feature = feature.reshape(X.shape[0], 1)
        onehot_encoder = OneHotEncoder(sparse=False)
        feature = onehot_encoder.fit_transform(feature)
        onehotencoder_dict[i] = onehot_encoder
        if X_train is None:
          X_train = feature
          X_train = np.concatenate((X_train, feature), axis=1)

    Now I save this onehotencoder_dict and label encoder_dict and use it later for encoding.

    def getEncoded(test_data,labelencoder_dict,onehotencoder_dict):
        test_encoded_x = None
        for i in range(0,test_data.shape[1]):
            label_encoder =  labelencoder_dict[i]
            feature = label_encoder.transform(test_data[:,i])
            feature = feature.reshape(test_data.shape[0], 1)
            onehot_encoder = onehotencoder_dict[i]
            feature = onehot_encoder.transform(feature)
            if test_encoded_x is None:
              test_encoded_x = feature
              test_encoded_x = np.concatenate((test_encoded_x, feature), axis=1)
      return test_encoded_x