I'm using sklearn pipelines to preprocess my data.
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler()),
('imputer', KNNImputer(n_neighbors=2,weights='uniform', metric='nan_euclidean', add_indicator=True))
])
categorical_transformer = Pipeline(steps=[
('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
from sklearn.compose import make_column_selector as selector
numeric_features = ['Latitud','Longitud','Habitaciones','Dormitorios','Baños','Superficie_Total','Superficie_cubierta']
categorical_features = ['Tipo_de_propiedad']
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features, selector(dtype_exclude="category"))
,('categorical', categorical_transformer, categorical_features, selector(dtype_include="category"))])
The feature Tipo_de_propiedad
has 3 classes: 'Departamento', 'Casa', 'PH'. So the 7 other features plus these dummies should give me 10 after transforming, but when I apply fit_transform
, it returns 14 features.
train_transfor=pd.DataFrame(preprocessor.fit_transform(X_train))
train_transfor.head()
When I use pd.get_dummies
it works well, but I can't use that to apply in the Pipeline
; OneHotEncoder
is useful because I can fit on train set and transform on the test set.
dummy=pd.get_dummies(df30[["Tipo_de_propiedad"]])
df_new=pd.concat([df30,dummy],axis=1)
df_new.head()
Your KNNImputer
has used the parameter add_indicator=True
, so the additional columns are presumably missingness indicators for some of your numeric columns.