I have a dataset users
. Each user has gender and color property (favorite color), and so on. I divided each color and sum of users of one gender which like this color to one list:
features_train = [['indigo', 2341], ['yellow', 856], ['lavender', 690], ['yellowgreen', 1208], ['indigo', 565], ['yellow', 103], ['lavender', 571], ['yellowgreen', 234] ...]
In the second list for each element from the first list I say which gender represent this element:
labels_train = [0, 0, 0, 0, 1, 1, 1, 1, ...]
And now I have the third list with colors: features_test = ['yellow', 'red', ...]
, and I need to predict a gender.
I have to use naive_bayes.GaussianNB
function from sklearn
and I will have more properties for users
, but to explain my problem I use just color and gender. So, I found an official example but I can't understand how should I reformat my datasets to work with them. Should I convert my color to some number representation like: [[0, 2341], [1, 856]]
or maybe I should use some another function from sklearn
to do that?
import numpy as np
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
print(clf.predict(features_test))
In order to perform 'machine learning' on text documents with scikit-learn, you first need to turn the text content into numerical feature vectors.
The most intuitive way to do so is the bags of words representation - you can solve that by indeed reformatting your dataset like you have mentioned.
Given that your 'X' and 'y's are both 1-D I would recommend to convert your text classes into a set of numerical feature vectors by using the LabelEnconder in scikit-learn.
See below:
import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
le = preprocessing.LabelEncoder()
#Fit label encoder and return encoded features
features_train_num = le.fit_transform(features_train)
features_test_num = le.transform(features_test)
#Fit label encoder and return encoded labels
labels_train_num = le.fit_transform(labels_train)
labels_test_num = le.transform(labels_test)
clf.fit(features_train_num, labels_train_num)
print(clf.predict(features_test_num))