Search code examples
pythonpandasmachine-learningscikit-learnone-hot-encoding

Found input variables with inconsistent numbers of samples: OHE


In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot use pd_dummies. In train I have x rows and in test 1 row, after OHE the test row is shorter and I have no Idea how to compare it with train.

le = LabelEncoder()
dfle = df.apply(le.fit_transform)
X = dfle.values
ohe = OneHotEncoder(handle_unknown='ignore')
X = ohe.fit_transform(X).toarray()


le = LabelEncoder()
testle = test.apply(le.fit_transform)
y = testle.values
two = OneHotEncoder(handle_unknown='ignore')
y = two.fit_transform(y).toarray()


rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)

rf.predict([[ ? ]])

Output of X and y:

X:
[[0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0.
  0. 1.]
 [0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0.
  1. 1.]
 [0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.
  0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1.
  0. 1.]]

y:
[[1. 1. 1. 1. 1. 1. 1. 1. 1.]]

Solution

  • First, I think you misunderstand what X and y mean. X represents your features, y your target(s). It's different from X_train, X_test, y_train, y_test. If y represents your test data, you should rename it to be clearer.

    Here, it seems y is your test data:

    In train I have x rows and in test 1 row

    You should use your first transformers (used for X) to transform (and only transform, not fit!) your data.

    What you should not do:

    df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
    ohe = OneHotEncoder(sparse=False)
    X_train = ohe.fit_transform(df1)
    
    df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
    ohe = OneHotEncoder(sparse=False)
    X_test = ohe.fit_transform(df2)
    
    # X_train
    # array([[0., 1., 1., 0.],
    #        [1., 0., 0., 1.]])
    
    # X_test
    # array([[1., 1.]])  # shape differs from X_train
    

    What you should do:

    df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
    ohe = OneHotEncoder(sparse=False)
    X_train = ohe.fit_transform(df1)
    
    df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
    X_test = ohe.transform(df2)
    
    # X_train
    # array([[0., 1., 1., 0.],
    #        [1., 0., 0., 1.]])
    
    # X_test
    # array([[0., 1., 1., 0.]])  # same shape as X_train