Search code examples
machine-learningscikit-learnclassificationcategorical-datasklearn-pandas

scikit learn discretizing categorical numeric data


I'm trying to discretize data for classification. They values were strings, and I converted them to numbers 0,1,2,3.

This is what the data looks like (pandas dataframe). I have split the dataframe into dataLabel and dataFeatures

Label   Feat1  Feat2  Feat3
  0        0     3      0
  1        1      1     2
  2        0      2     2
  3        1      3     3

I want to use scikit learn's Decision Tree and Multinomial Naive Bayes and am trying to discretize the data using DictVectorizer

This is what I have

dictvec = dataFeatures.T.to_dict().values()

from sklearn.feature_extraction import DictVectorizer as DV vectorizer = DV( sparse = False ) X = vectorizer.fit_transform(dictvec)

Y = dataLabel.ravel()

This is my input to classifier

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

from sklearn import metrics
scores = cross_val_score(mnb, Y, X, cv=10, scoring='accuracy')

I get an error bad input shape (64, 4) but I am not sure if that has to do with how the data is discretized.

My question is - is this the correct way to discretize the data? Is my code correct or is there a better way to do it?


Solution

  • So the error was that Y and X were in the wrong order - it should be scores = cross_val_score(mnb, X, Y, cv=10, scoring='accuracy') . Code is now running correctly - and from looking into different options - I found that using OneHotEncoder was the better than DictVectorizer