Search code examples
pythontensorflowkerasneural-networkone-hot-encoding

What is the use of the ID field in the source code?


Check out the following source code:

import pandas as pd
from tensorflow.keras import layers, models

colors_df = pd.DataFrame(data=[[5,'yellow'],[1,'red'],[2,'blue'],[3,'green'],[4,'blue'],[7,'purple']], columns=['id', 'color'])

categorical_input = layers.Input(shape=(1,), dtype=tf.string)
one_hot_layer = OneHotEncodingLayer()
one_hot_layer.adapt(colors_df['color'].values)
encoded = one_hot_layer(categorical_input)

numeric_input = layers.Input(shape=(1,), dtype=tf.float32)

concat = layers.concatenate([numeric_input, encoded])

model = models.Model(inputs=[numeric_input, categorical_input], outputs=[concat])
predicted = model.predict([colors_df['id'], colors_df['color']])
print(predicted)
# [[5. 0. 1. 0. 0. 0.]
#  [1. 0. 0. 1. 0. 0.]
#  [2. 1. 0. 0. 0. 0.]
#  [3. 0. 0. 0. 0. 1.]
#  [4. 1. 0. 0. 0. 0.]
#  [7. 0. 0. 0. 1. 0.]]

In the above article, they wrote:

This simple network just accepts a categorical input, One Hot Encodes it, then concatenates the One Hot Encoded features with the numeric input feature. Notice I’ve added a numeric id column to the DataFrame to illustrate how to split categorical inputs from numeric inputs.

I haven't understood this.

Why was an id column supplied along with those 5-digit one-hot codes?

What was its use in the overall application?


Solution

  • The blog post, simply added Id, to keep a connection between input strings and one hot encoded output, in order to audiences be able to track which input string, converted to which one hot row.

    It just added the Ids as input and without no processing at output to show you, e.g. yellow which it's id is 5, converted to [0. 1. 0. 0. 0.].

    It has no other effect on the model i.e. performance, but just for demonstration purposes.