Check out the following source code:
import pandas as pd
from tensorflow.keras import layers, models
colors_df = pd.DataFrame(data=[[5,'yellow'],[1,'red'],[2,'blue'],[3,'green'],[4,'blue'],[7,'purple']], columns=['id', 'color'])
categorical_input = layers.Input(shape=(1,), dtype=tf.string)
one_hot_layer = OneHotEncodingLayer()
one_hot_layer.adapt(colors_df['color'].values)
encoded = one_hot_layer(categorical_input)
numeric_input = layers.Input(shape=(1,), dtype=tf.float32)
concat = layers.concatenate([numeric_input, encoded])
model = models.Model(inputs=[numeric_input, categorical_input], outputs=[concat])
predicted = model.predict([colors_df['id'], colors_df['color']])
print(predicted)
# [[5. 0. 1. 0. 0. 0.]
# [1. 0. 0. 1. 0. 0.]
# [2. 1. 0. 0. 0. 0.]
# [3. 0. 0. 0. 0. 1.]
# [4. 1. 0. 0. 0. 0.]
# [7. 0. 0. 0. 1. 0.]]
In the above article, they wrote:
This simple network just accepts a categorical input, One Hot Encodes it, then concatenates the One Hot Encoded features with the numeric input feature. Notice I’ve added a numeric
id
column to the DataFrame to illustrate how to split categorical inputs from numeric inputs.
I haven't understood this.
Why was an id
column supplied along with those 5-digit one-hot codes?
What was its use in the overall application?
The blog post, simply added Id, to keep a connection between input strings and one hot encoded output, in order to audiences be able to track which input string, converted to which one hot row.
It just added the Ids as input and without no processing at output to show you, e.g. yellow
which it's id is 5, converted to [0. 1. 0. 0. 0.]
.
It has no other effect on the model i.e. performance, but just for demonstration purposes.