Search code examples
pythontensorflowtensorflow2.0one-hot-encoding

One-hot encoding using tf.data mixes up columns


Minimum working examples

Consider the following CSV file (example.csv)

animal,size,weight,category
lion,large,200,mammal
ostrich,large,150,bird
sparrow,small,0.1,bird
whale,large,3000,mammal
bat,small,0.2,mammal
snake,small,1,reptile
condor,medium,12,bird

The goal is to convert all the categorical values into one-hot encodings. The standard way to do this in Tensorflow 2.0 is to use tf.data. Following that example, the code to deal with the dataset above is

import collections
import tensorflow as tf

# Load the dataset.
dataset = tf.data.experimental.make_csv_dataset(
    'example.csv',
    batch_size=5,
    num_epochs=1,
    shuffle=False)

# Specify the vocabulary for each category.
categories = collections.OrderedDict()
categories['animal'] = ['lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor']
categories['size'] = ['large', 'medium', 'small']
categories['category'] = ['mammal', 'reptile', 'bird']

# Define the categorical feature columns.
categorical_columns = []
for feature, vocab in categories.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

# Retrieve the first batch and apply the one-hot encoding to it.
iterator = iter(dataset)
first_batch = next(iterator)
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)

print(categorical_layer(first_batch).numpy())

Question

Running the code above, one gets

[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.]]

where it looks like the two last columns size and category have been flipped, despite the fact that categories is an ordered dictionary and the pre-existing order of the columns in the actual dataset. It's as if tf.feature_column.categorical_column_with_vocabulary_list() did some unwarranted alphabetical sorting of the columns.

What's the reason for the above. Is this really the best way to do one-hot encoding in the spirit of tf.data?


Solution

  • Where is the sorting?

    The sorting isn't occuring at tf.feature_column.categorical_column_with_vocabulary_list(). If you print categorical_columns, you will see that the columns are still in the order you added them to the feature_column:

    [
      IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='animal', vocabulary_list=('lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
      IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='size', vocabulary_list=('large', 'medium', 'small'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
      IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='category', vocabulary_list=('mammal', 'reptile', 'bird'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
    ]
    

    The sorting occurs in the tf.keras.layers.DenseFeatures object.

    In the code, you can see where the sorting occurs here (I found this by tracing the class inheritance from the tf.keras.layers.DenseFeatures class to the tensorflow.python.feature_column.dense_features.DenseFeatures class to the tensorflow.python.feature_column.feature_column_v2._BaseFeaturesLayer class to the _normalize_feature_columns function).

    Why is it sorted?

    So why is it sorted? Elsewhere in the same file containing the _normalize_feature_columns function (which is the function where the data is sorted), there is a similar sorting function with this comment:

    # Sort the columns so the default collection name is deterministic even if the
    # user passes columns from an unsorted collection, such as dict.values().
    

    I think this explanation applies to why columns are sorted when using the tf.keras.layers.DenseFeatures class too. Your columns and data are are consistent, but tensorflow doesn't assume that the input will be consistent, so it sorts it to ensure a consistent order.