Search code examples
pythonpandasmachine-learningscikit-learnnan

classifiers in scikit-learn that handle nan/null


I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree


Solution

  • I made an example that contains both missing values in training and the test sets

    I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

    from __future__ import print_function
    
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.impute import SimpleImputer
    
    
    X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
    Y_train = [0, 1]
    X_test_1 = [0, 0, np.nan]
    X_test_2 = [0, np.nan, np.nan]
    X_test_3 = [np.nan, 1, 1]
    
    # Create our imputer to replace missing values with the mean e.g.
    imp = SimpleImputer(missing_values=np.nan, strategy='mean')
    imp = imp.fit(X_train)
    
    # Impute our data, then train
    X_train_imp = imp.transform(X_train)
    clf = RandomForestClassifier(n_estimators=10)
    clf = clf.fit(X_train_imp, Y_train)
    
    for X_test in [X_test_1, X_test_2, X_test_3]:
        # Impute each test item, then predict
        X_test_imp = imp.transform(X_test)
        print(X_test, '->', clf.predict(X_test_imp))
    
    # Results
    [0, 0, nan] -> [0]
    [0, nan, nan] -> [0]
    [nan, 1, 1] -> [1]