Search code examples
pythonnumpymachine-learningscikit-learnsupervised-learning

How to implement Cross Validation and Random Forest Classifier given feature sets as dictionaries?


I have got my featuresets as a dictionary containing elements in the form:

({0: 0.48447204968944096, 
  1: 0.035093167701863354, 
  2: 0.07453416149068323, 
  3: 0.046583850931677016, 
  4: 0.0, 
  5: 0.09316770186335403,
  ...
  162: 1, 
  163: 1.0}, 'male')

When I try implementing the cross_val_score or cross_val_predict from the sklearn library, it always results showing some error saying

"float values cannot be dict".

Could someone please help me implementing the cross-validation using Linear SVC and Random-Forest classifier in Python?

I had tried this before:

train_set, test_set = featuresets[1:1628], featuresets[1630:3257]
np.asarray(train_set)
np.asarray(test_set)
clf = SVC(kernel='linear', C=5)
predicted = cross_val_predict(clf, train_set, test_set, cv=10)
metrics.accuracy_score(test_set, predicted)

Also, I am not getting how to implement the kfold cross-validation here.


Solution

  • Let us first import the necessary modules:

    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    

    You have to create an instance of a random forest classifier like this:

    clf = RandomForestClassifier()
    

    Then you need to load featuresets (I don't have this data so I couldn't test my code) and convert your categorical variable into a numerical one, for example through a dictionary:

    featuresets = # your code here
    gender = {'male': 0, 'female': 1}
    

    Next step consists in storing the features and labels as NumPy arrays:

    X = np.asarray([[i[1] for i in sorted(d.items())] for d, _ in featuresets])
    y = np.asarray([gender[s] for _, s in featuresets])
    

    Now you are ready to estimate the accuracy of a random forest classifier on your dataset by splitting the data, fitting a model and computing the score 10 consecutive times (with different splits each time):

    scores = cross_val_score(clf, X, y, cv=10)
    print('Scores =', scores)
    

    If you run the snippets above you should get a list of 10 scores printed.