Search code examples
scikit-learnartificial-intelligencerandom-forest

intel daal4py classifiers with scikit-learn


I am testing the sklearn-compatible wrappers for the latest version of the intel daal4py classifiers. The intel k-nearest classifier works fine with sklearn’s cross_val_score() and GridSearchCV. The performance boost from the intel classifier is significant and the intel and sklearn models provide generally comparable results across 10 different large public datasets and some simulated datasets. The sklearn-compatible wrapper for the intel random forest classifier seems to be completely broken. The score() method does not work so I cannot proceed further with the intel random forest wrapper class.

I posted this at the intel AI Developer Forum, but I was wondering if anyone here has gotten the intel sklearn-compatible random forest classifier to work.

My next step is to test the native daal4py random forest object and possibly write my own wrapper because the native daal4py api is so different from sklearn. I was hoping to avoid this. There seems to be some confusion on the intel site regarding the names of the wrapper classes.

I am using:

  • For k-nearest: daal4py.sklearn.neighbors.kdtree_knn_classifier (this works fine)
  • For random forest: daal4py.sklearn.ensemble.decision_forest.RandomForestClassifier

The failure in the intel RandomForestClassifier is in forest.py because n_classes_ is an int. n_classes_ matches the number of classes for the label variable that is passed. The label variable is an integer.

predictions = [np.zeros((n_samples, n_classes_[k]))
                for k in range(self.n_outputs_)]

Solution

  • Please find below the steps we used to compute scores for daal4py RandomForestClassifier

    (i) For cross_val_score

    from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    clf = RandomForestClassifier()
    scores = cross_val_score(clf, train_data, train_labels, cv=3)
    print(scores)
    

    (ii)For GridSearchCV

    from sklearn.model_selection import GridSearchCV
    from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
    param_grid = { 
        'n_estimators': [200, 700],
        'max_features': ['auto', 'sqrt', 'log2']
    }
    clf = RandomForestClassifier()
    CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
    CV_rfc.fit(train_data, train_labels)
    score=CV_rfc.score(train_data, train_labels)