How should I reformat my data for sklearn.naive_bayes.GaussianNB

I have a dataset users. Each user has gender and color property (favorite color), and so on. I divided each color and sum of users of one gender which like this color to one list:

features_train = [['indigo', 2341], ['yellow', 856], ['lavender', 690], ['yellowgreen', 1208], ['indigo', 565], ['yellow', 103], ['lavender', 571], ['yellowgreen', 234] ...]

In the second list for each element from the first list I say which gender represent this element:

labels_train = [0, 0, 0, 0, 1, 1, 1, 1, ...]

And now I have the third list with colors: features_test = ['yellow', 'red', ...], and I need to predict a gender.

I have to use naive_bayes.GaussianNB function from sklearn and I will have more properties for users, but to explain my problem I use just color and gender. So, I found an official example but I can't understand how should I reformat my datasets to work with them. Should I convert my color to some number representation like: [[0, 2341], [1, 856]] or maybe I should use some another function from sklearn to do that?

import numpy as np
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
print(clf.predict(features_test))

Solution

In order to perform 'machine learning' on text documents with scikit-learn, you first need to turn the text content into numerical feature vectors.

The most intuitive way to do so is the bags of words representation - you can solve that by indeed reformatting your dataset like you have mentioned.

Given that your 'X' and 'y's are both 1-D I would recommend to convert your text classes into a set of numerical feature vectors by using the LabelEnconder in scikit-learn.

See below:

import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
le = preprocessing.LabelEncoder()


#Fit label encoder and return encoded features
features_train_num = le.fit_transform(features_train)
features_test_num  = le.transform(features_test)

#Fit label encoder and return encoded labels
labels_train_num   = le.fit_transform(labels_train)
labels_test_num    = le.transform(labels_test)

clf.fit(features_train_num, labels_train_num)
print(clf.predict(features_test_num))