Search code examples
python-3.xscikit-learnclassificationsklearn-pandas

What to pass to clf.predict()?


I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Docs state: clf.predict(X)

Parameters: X : array-like or sparse matrix of shape = [n_samples, n_features]

But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?

Code below:

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree

pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150

lenght = 50000

miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]

DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})

DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')

target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values

clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)

clf.predict(?????) #### <===== What should go here?

clf.predict([30,4000,1])

ValueError: Expected 2D array, got 1D array instead: array=[3.e+01 4.e+03 1.e+00]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

clf.predict(np.array(30,4000,1))

ValueError: only 2 non-keyword arguments accepted


Solution

  • Where is your "mock data" that you want to predict?

    Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.

    Now when you do

    clf.predict([30,4000,1])
    

    The model is not able to understand that these are columns of a same row or data of different rows.

    So you need to convert that into 2-d array, where inner array represents the single row.

    Do this:

    clf.predict([[30,4000,1]])     #<== Observe the two square brackets
    

    You can have multiple rows to be predicted, each in inner list. Something like this:

    X_test = [[30,4000,1],
              [35,15000,0],
              [40,2000,1],]
    clf.predict(X_test)
    

    Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.

    According to the documentation, the signature of np.array is:

    (object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
    

    Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.

    Like this: np.array([30,4000,1]) Now this will be considered correctly as input to object param.