python-3.x scikit-learn classification sklearn-pandas

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Docs state: clf.predict(X)

Parameters: X : array-like or sparse matrix of shape = [n_samples, n_features]

But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?

Code below:

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree

pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150

lenght = 50000

miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]

DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})

DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')

target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values

clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)

clf.predict(?????) #### <===== What should go here?

clf.predict([30,4000,1])

ValueError: Expected 2D array, got 1D array instead: array=[3.e+01 4.e+03 1.e+00]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

clf.predict(np.array(30,4000,1))

ValueError: only 2 non-keyword arguments accepted

Solution

Where is your "mock data" that you want to predict?

Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.

Now when you do

clf.predict([30,4000,1])

The model is not able to understand that these are columns of a same row or data of different rows.

So you need to convert that into 2-d array, where inner array represents the single row.

Do this:

clf.predict([[30,4000,1]])     #<== Observe the two square brackets

You can have multiple rows to be predicted, each in inner list. Something like this:

X_test = [[30,4000,1],
          [35,15000,0],
          [40,2000,1],]
clf.predict(X_test)

Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.

According to the documentation, the signature of np.array is:

(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)

Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.

Like this: np.array([30,4000,1]) Now this will be considered correctly as input to object param.