I have 2 dataframes , testing and training , that at the beginning have the same numbers of columns.
But , because in the columns with categorical data the 2 dataframes
have different values , after applying
to them the one hot encoding , the encoded dataframes
have different numbers of columns and so it become
impossible to make a prediction.
How can I do to get the 2 encoded dataframes with the numbers of columns aligned(for name and order)?
I have found 5 cases similar to mine , I applied what I found but I didn't success.
I have tried before using pd.get_dummies
and then ColumnTransformer
.
This is the piece of code about encoding:
for k in range(80):
if isinstance(X_train.iloc[0,k],str):
lista0.append(X_train.columns[k])
indice0.append(k)
else:
lista1.append(X_train.columns[k])
indice1.append(k)
for z in range(80):
if isinstance(X_test.iloc[0,z],str):
lista2.append(X_test.columns[z])
indice2.append(z)
else:
lista3.append(X_test.columns[z])
indice3.append(z)
X_train_tran = ColumnTransformer([('onehot',OneHotEncoder(sparse_output=False),indice0),('nothing','passthrough',indice1)])
X_test_tran = ColumnTransformer([('onehot',OneHotEncoder(sparse_output=False),indice2),('nothing','passthrough',indice3)])
X1_train = X_train_tran.fit_transform(X_train)
X1_test = X_test_tran.fit_transform(X_test)
I have made 2 lists , lista0
for categorical columns and lista1
for numerical columns.
To make sure the columns are aligned, you should not fit the encoders separately on training and testing data. Instead, you should fit the encoders on the training data and then use them to transform both the training and testing data. This way, the columns and order will be consistent between the two.
# Identify categorical and numerical columns
categorical_cols = lista0
numerical_cols = lista1
# Create a ColumnTransformer with a single pipeline for both categorical and numerical columns
preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(sparse_output=False), categorical_cols),
('num', 'passthrough', numerical_cols)]
)
# Fit the preprocessor on the training data and transform both training and testing data
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)