I'm trying to discretize data for classification. They values were strings, and I converted them to numbers 0,1,2,3.
This is what the data looks like (pandas dataframe). I have split the dataframe into dataLabel
and dataFeatures
Label Feat1 Feat2 Feat3
0 0 3 0
1 1 1 2
2 0 2 2
3 1 3 3
I want to use scikit learn's Decision Tree and Multinomial Naive Bayes and am trying to discretize the data using DictVectorizer
This is what I have
dictvec = dataFeatures.T.to_dict().values()
from sklearn.feature_extraction import DictVectorizer as DV
vectorizer = DV( sparse = False )
X = vectorizer.fit_transform(dictvec)
Y = dataLabel.ravel()
This is my input to classifier
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
from sklearn import metrics
scores = cross_val_score(mnb, Y, X, cv=10, scoring='accuracy')
I get an error bad input shape (64, 4)
but I am not sure if that has to do with how the data is discretized.
My question is - is this the correct way to discretize the data? Is my code correct or is there a better way to do it?
So the error was that Y and X were in the wrong order - it should be scores = cross_val_score(mnb, X, Y, cv=10, scoring='accuracy')
.
Code is now running correctly - and from looking into different options - I found that using OneHotEncoder
was the better than DictVectorizer