Search code examples
machine-learningscikit-learnnaivebayes

How to match test columns with train data?


Getting an error while trying to use naive bayes.

from sklearn.naive_bayes import GaussianNB
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/tennis.csv')

X_train = pd.get_dummies(df[['outlook', 'temp', 'humidity', 'windy']])
y_train = df['play']

gNB = GaussianNB()
gNB.fit(X_train, y_train)

ndf=pd.DataFrame({'outlook':['sunny'], 'temp':['hot'], 'humidity':['normal'], 'windy':[False]})
X_test=pd.get_dummies(ndf[['outlook', 'temp', 'humidity', 'windy']])

gNB.predict(X_test)

ValueError: operands could not be broadcast together with shapes (1,4) (9,)

Is it a good idea to use get_dummies method in this case?


Solution

  • Obviously not a good practice as pointed by vivek but you here is the code if you want to do anyway:

    from sklearn.naive_bayes import GaussianNB
    import pandas as pd
    df = pd.read_csv('https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/tennis.csv')
    
    X_train = pd.get_dummies(df[['outlook', 'temp', 'humidity', 'windy']])
    y_train = df['play']
    
    gNB = GaussianNB()
    gNB.fit(X_train, y_train)
    
    ndf=pd.DataFrame({'outlook':['sunny'], 'temp':['hot'], 'humidity':['normal'], 'windy':[False]})
    X_test=pd.get_dummies(ndf[['outlook', 'temp', 'humidity', 'windy']])
    
    dict1 = {}
    X_test.columns
    for i in X_train.columns:
      if i in X_test.columns:
        dict1.update({i:[1]})
      else:
        dict1.update({i:[0]})
    X_test_new = pd.DataFrame(data = dict1)
    
    
    gNB.predict(X_test_new)