google-cloud-platform scikit-learn xgboost

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions

The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.

Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks

import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# these are the column labels from the census data files
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

# load training set
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# load test set
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')

# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)

# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')

Solution

Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.

This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.

import json

import pandas as pd
from sklearn.preprocessing import LabelEncoder

COLUMNS = (
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income-level",
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
)

with open("./input.json", "r") as json_lines:
    rows = [json.loads(line) for line in json_lines]

prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))

encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    prediction_features[col] = encoders[col].fit_transform(prediction_features[col])

with open("input_numerical.json", "w") as input_numerical:
    for index, row in prediction_features.iterrows():
        input_numerical.write(row.to_json(orient="values") + "\n")

I created this Google Issues Tracker ticket as the Google documentation is missing this important step.