SkFlow: Inputing numerical and text data into the model

I'm in the early stages of learning SkFlow/TensorFlow, so I'll lay out my understanding of what I'm trying to do, incorrect as it may be.

Let's imagine I'm trying to build a model to predict if a car will fail an emissions test.

My training and testing csv might look something like this

 make, fuel,   year, mileage, days since service, passed test
   vw, diesel, 2015, 10000,   20,                 0
honda, petrol, 2008, 1000000, 234,                1

So the pass/fail column being by y, the others being x.

So far, with Baltimore's help in my previous SO question I'm able to process the Iris dataset from a CSV file. That dataset is all numbers however.

This example on the TensorFlow website shows a model built with census data, using categorical and continuous data. I'm trying to use SkFlow as I understand it simplifies the process.

Anyway, to my code

x_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype=None, delimiter=',' , usecols=(0, 1, 2, 3,4))
y_train = genfromtxt('/Users/ben/Desktop/data.csv', dtype='int', delimiter=',', usecols = (5))

feature_columns = [tf.contrib.layers.real_valued_column("", dimension=1)]

classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                    hidden_units=[10, 20, 10],
                                    n_classes=2,
                                    model_dir="./tmp/model1")

# Fit model. Add your train data here
classifier.fit(x=x_train,y=y_train,steps=2000)

So I've got my csv data reading in fine into my x_train and y_train objects. The CSV has no headers, but could do if required.

I believe I'm trying to define which columns have which kind of data, something like

make = tf.contrib.layers.sparse_column_with_hash_bucket("make", hash_bucket_size=1000)

fuel = tf.contrib.layers.sparse_column_with_keys(column_name="fuel", keys=["diesel", "petrol"])

How do I build the feature_columns object that gets passed into the classifier?

Solution

Here's my shot at it. The input_fn function creates an dict of tensors that are passed into the fit and evaluate methods via wrappers. That dict is used when creating the model. It defines the data that will be used. The other constant value tensors are the data. They are what's passed in during model fitting with the feature_columns argument: feature_columns=[gear,mpg,cyl...].

I left out all of the crossed columns stuff, but it could be put in.

I turned off WARNINGS but if you want that, the switch is there. This also produces as surprising amount of log data, so be sure to check out the graphs with tensorboard.

# an experiment with regression in Tensorflow using one categorical feature
# MTCARS - auto data. Is the car an Automatic or a Manual Shift?
# Data set location: https://vincentarelbundock.github.io/Rdatasets/datasets.html
# Below is a HIGHLY cut down version of the tensorflow wide tutorial at:
# https://www.tensorflow.org/tutorials/wide/

import tensorflow as tf
import numpy as np
import urllib.request
import tempfile
import pandas as pd
from sklearn.model_selection import train_test_split

LABEL_COLUMN = "label"
COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
CONTINUOUS_COLUMNS = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","carb"]
CATEGORICAL_COLUMNS = ["gear"]

# had to update the urllib stuff for 3.5.
# pull down csv file
# I am running on ubuntu 14.04, so I don't know how well the tempfile stuff    will work on Windows.
# NamedTemporaryFile might have problems
data_file = tempfile.NamedTemporaryFile()
urllib.request.urlretrieve("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", data_file.name)

cars = pd.read_csv(data_file, names=COLUMNS, skipinitialspace=True,skiprows=1)
# I want the "am" column as my label, so rename it - not really necessary,
# just trying to stay in sync the wide tutorial
# am: 0 = Automatic 1 = Manual
cars.rename(columns={'am':LABEL_COLUMN}, inplace=True)

# turn gears into a categorical variable, again not really useful, but I want    some categorical data
# turn the numbers into strings. I sure there is a oneliner somewhere that can do this...
cars['gear'] = cars['gear'].astype(str)
cars['gear'] = cars['gear'].replace({'3': 'THREE'}, regex=True)
cars['gear'] = cars['gear'].replace({'4': 'FOUR'}, regex=True)
cars['gear'] = cars['gear'].replace({'5': 'FIVE'}, regex=True)

# split into train and tests set - there is a woefully small number of rows here. Need a bigger data set.
train, test = train_test_split(cars, test_size = 0.2)

# These methods are a copy of the input functions from the tensorflow wide tutorial updated for python 3.5
def input_fn(df):
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      shape=[df[k].size, 1])
                  for k in CATEGORICAL_COLUMNS}

  # Merges the two dictionaries into one.
  # Old CODE
  #feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # NEW CODE - python 3.5
  feature_cols = dict(continuous_cols)
  feature_cols.update(categorical_cols)

  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(train)

def eval_input_fn():
  return input_fn(test)

# shut down WARNINGs
# You can adjust by using DEBUG, INFO, WARN, ERROR, or FATAL
tf.logging.set_verbosity(tf.logging.ERROR)

# set up the TF column for the categorical variable
gear = tf.contrib.layers.sparse_column_with_keys(column_name="gear", keys=["THREE", "FOUR", "FIVE"])
# if my categorical data had more than 10 keys, I would use:
#gear = tf.contrib.layers.sparse_column_with_hash_bucket("gear", hash_bucket_size=1000)

# set up the TF columns for the continous variables
mpg = tf.contrib.layers.real_valued_column("mpg")
cyl = tf.contrib.layers.real_valued_column("cyl")
disp = tf.contrib.layers.real_valued_column("disp")
hp = tf.contrib.layers.real_valued_column("hp")
drat = tf.contrib.layers.real_valued_column("drat")
wt = tf.contrib.layers.real_valued_column("wt")
qsec = tf.contrib.layers.real_valued_column("qsec")
vs = tf.contrib.layers.real_valued_column("vs")
carb = tf.contrib.layers.real_valued_column("carb")

# Build the model. Make sure the logs dir already exists.
model_dir = "./logs"
m = tf.contrib.learn.LinearClassifier(
    feature_columns=[gear,mpg,cyl,disp,hp,drat,wt,qsec,vs,carb],
    optimizer=tf.train.FtrlOptimizer(
    learning_rate=0.01,
    l1_regularization_strength=1.0,
    l2_regularization_strength=1.0),
    model_dir=model_dir)

m.fit(input_fn=train_input_fn,steps=200)

# Results were not bad for a very small data set, but the recall is suspect
# In reality, these numbers don't mean a thing with such small data

results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))