Search code examples
scikit-learnpython-3.7decision-treeone-hot-encoding

scikit-learn: how to predict new data if after one hot encoding it has fewer features than the training/testing sets


I'm trying to use scikit-learn in my first ever ML project, using its DecisionTreeClassifier with data samples containing both numeric and categorical features, ex: ['High', 33, 'No', 4].

I'm at a point where I've been able to.

  1. Read the training and test data from a .csv file.

    physio = pd.read_csv('data.csv', header=None, names=['HR', 'M', 'T', 'W', 'D'])

  2. Extract the target class:
    labels = physio.pop('D')

  3. One-hot-encode the categorical features using pandas.get_dummies. This also increases the number of features from 4 to 6 (since 'HR' and 'T' become 'HR_High'/'HR_Low' and 'T_Yes'/'T_No') respectively)

    for col in physio.dtypes[physio.dtypes == 'object'].index:
        for_dummy = physio.pop(col)
        physio = pd.concat([physio, pd.get_dummies(for_dummy, prefix=col)], axis=1)
    
  4. Split the set in train and test subsets.

    x_train, x_test, y_train, y_test = train_test_split(physio, labels, test_size=0.25)
    
  5. Instantiate and fit the tree

    dt = DecisionTreeClassifier(max_depth=8, min_samples_split=.3, min_samples_leaf=.26, max_features=4)
    dt.fit(x_train, y_train)
    
  6. Classify the test set in the first test split

    y_pred = dt.predict(x_test)
    
  7. And use the y_test and y_pred to evaluate the classification using confussion matrix (also ROC's AUC)

    conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred, labels=['Yes', 'No'])
    

I'm sorry if I use the wrong terminology but, now I'm trying to do what this is all supposed to be for, which is classifying incoming data, and unfortunately, this is where all tutorials I've seen fall short, they never get to the part of the procedure to classify new data, once all splitting and training and testing has happened.

The way I naively tried it:

  1. Since the actual data will be coming from a command line argument, I figured I'd store them in an array and pass it to a DataFrame.

    newSample = [['Low', 2, 'No', 8]]
    newSampleDF = pd.DataFrame(newSample, columns=['HR', 'M', 'T', 'W'])
    
  2. And then I tried to one-hot-encode it. This is where problems arouse since after the encoding is done, there are still 4 features because being just one data sample, it doesn't know anything about 'High' and 'Yes', so 'HR' and 'T' just become 'HR_Low' and 'T_No' respectively

    for col in newSampleDF.dtypes[newSampleDF.dtypes == 'object'].index:
        for_dummy = newSampleDF.pop(col)
        newSampleDF = pd.concat([newSampleDF, pd.get_dummies(for_dummy, prefix=col)], axis=1)
    

When I print newSampleDF it shows:

M  W  HR_Low  T_No
2  8    1      1

while the data I'm trying to classify it against is of the form

 M   W  HR_High  HR_Low  T_No  T_Yes
12  48     0       1      0      1

which is why I get the error:

ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 4

which is self-explanatory, I just don't know how to solve it. How do I get my new data to be encoded in a way it is aware of the missing values, 'High' and 'Yes' in this case.

I hope I made sense, feel free to point out errors and improvements, but remember, first-timer here.

Thanks


Solution

  • I think that the whole approach is a bit problematic. We don't need to use pd.get_dummies, as we already know the category columns. So why we don't use it directly?

    Thus I prefer the solution as follows:

    import pandas as pd
    
    cats = ["HR", "T"]
    
    training_data = pd.DataFrame([[12, 48, 0, 1, 0, 1]], columns=["M","W", "HR_High", "HR_Low", "T_No", "T_Yes"])
    
    raw_cols = ['HR', 'M', 'T', 'W']
    newSample = [['Low', 2, 'No', 8]]
    newSampleDF = pd.DataFrame(newSample, columns=raw_cols)
    
    cols = ["M","W", "HR_High", "HR_Low", "T_No", "T_Yes"]
    template = pd.DataFrame([6 * [None]], columns=cols)
    
    for col in raw_cols:
        if col in cats:
            template.loc[0, col + "_" + newSampleDF.loc[0, col]] = 1
        else:
            template.loc[0, col] = newSampleDF.loc[0, col]
    
    # replace Nans with 0
    newSampleDF = template.fillna(0)
    print(newSampleDF)
    

    Out:

       M  W  HR_High  HR_Low  T_No  T_Yes
    0  2  8        0       1     1      0