python python-2.7 tensorflow tensorflow-estimator

How do I load categorical data from a numpy array into an Indicator or Embedding column?

Using Tensorflow 1.8.0, we are running into an issue whenever we attempt to build a categorical column. Here is a full example demonstrating the problem. It runs as-is (using only numeric columns). Uncommenting the indicator column definition and data generates a stack trace ending in tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.

import tensorflow as tf
import numpy as np

def feature_numeric(key):
  return tf.feature_column.numeric_column(key=key, default_value=0)

def feature_indicator(key, vocabulary):
  return tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(
      key=key, vocabulary_list=vocabulary ))


labels = ['Label1','Label2','Label3']

model = tf.estimator.DNNClassifier(
  feature_columns=[
    feature_numeric("number"),
    # feature_indicator("indicator", ["A","B","C"]),
  ],
  hidden_units=[64, 16, 8],
  model_dir='./models',
  n_classes=len(labels),
  label_vocabulary=labels)

def train(inputs, training):
  model.train(
    input_fn=tf.estimator.inputs.numpy_input_fn(
        x=inputs,
        y=training,
        shuffle=True
      ), steps=1)

inputs = {
  "number": np.array([1,2,3,4,5]),
  # "indicator": np.array([
  #     ["A"],
  #     ["B"],
  #     ["C"],
  #     ["A", "A"],
  #     ["A", "B", "C"],
  #   ]),
}

training = np.array(['Label1','Label2','Label3','Label2','Label1'])

train(inputs, training)

Attempts to use an embedding fare no better. Using only numeric inputs, we can successfully scale to thousands of input nodes, and in fact we have temporarily expanded our categorical features in the preprocessor to simulate indicators.

The documentation for categorical_column_*() and indicator_column() are awash in references to features we're pretty sure we're not using (proto inputs, whatever bytes_list is) but maybe we're wrong on that?

Solution

The issue here is related to the ragged shape of the "indicator" input array (some elements are of length 1, one is length 2, one is length 3). If you pad your input lists with some non-vocabulary string (I used "Z" for example since your vocabulary is "A", "B", "C"), you'll get the expected results:

inputs = {
  "number": np.array([1,2,3,4,5]),
  "indicator": np.array([
    ["A", "Z", "Z"],
    ["B", "Z", "Z"],
    ["C", "Z", "Z"],
    ["A", "A", "Z"],
    ["A", "B", "C"]
  ])
}

You can verify that this works by printing the resulting tensor:

dense = tf.feature_column.input_layer(
  inputs,
  [
    feature_numeric("number"),
    feature_indicator("indicator", ["A","B","C"]),
  ])

with tf.train.MonitoredTrainingSession() as sess:
  print(dense)
  print(sess.run(dense))