Search code examples
pythonmachine-learningone-hot-encoding

one-hot-encoding : troubles fitting after applying encoding to train and test dataframes


I have 2 dataframes , testing and training , that at the beginning have the same numbers of columns. But , because in the columns with categorical data the 2 dataframes have different values , after applying to them the one hot encoding , the encoded dataframes have different numbers of columns and so it become impossible to make a prediction. How can I do to get the 2 encoded dataframes with the numbers of columns aligned(for name and order)?

I have found 5 cases similar to mine , I applied what I found but I didn't success. I have tried before using pd.get_dummies and then ColumnTransformer. This is the piece of code about encoding:

for k in range(80):
    if isinstance(X_train.iloc[0,k],str):
        lista0.append(X_train.columns[k])
        indice0.append(k)
    else:
        lista1.append(X_train.columns[k])
        indice1.append(k)
for z in range(80):
    if isinstance(X_test.iloc[0,z],str):
        lista2.append(X_test.columns[z])
        indice2.append(z)
    else:
        lista3.append(X_test.columns[z])
        indice3.append(z)
X_train_tran = ColumnTransformer([('onehot',OneHotEncoder(sparse_output=False),indice0),('nothing','passthrough',indice1)])
X_test_tran = ColumnTransformer([('onehot',OneHotEncoder(sparse_output=False),indice2),('nothing','passthrough',indice3)])
X1_train = X_train_tran.fit_transform(X_train)
X1_test = X_test_tran.fit_transform(X_test)

I have made 2 lists , lista0 for categorical columns and lista1 for numerical columns.


Solution

  • To make sure the columns are aligned, you should not fit the encoders separately on training and testing data. Instead, you should fit the encoders on the training data and then use them to transform both the training and testing data. This way, the columns and order will be consistent between the two.

    # Identify categorical and numerical columns
    categorical_cols = lista0
    numerical_cols = lista1
    
    # Create a ColumnTransformer with a single pipeline for both categorical and numerical columns
    preprocessor = ColumnTransformer(transformers=[
        ('cat', OneHotEncoder(sparse_output=False), categorical_cols),
        ('num', 'passthrough', numerical_cols)]
    )
    # Fit the preprocessor on the training data and transform both training and testing data
    X_train_transformed = preprocessor.fit_transform(X_train)
    X_test_transformed = preprocessor.transform(X_test)