I'm seeing that my keras model does not handle input columns well if they are not float values. I'd like to be able to train the model using columns that contain "labels", and by labels I mean IDs of sorts, or encoded string names. Ideally it would be able to integrate these label columns into its model, deciding which values within these categorical columns predicate a higher accuracy.
For example, I'm trying to predict the outcomes of a competition (Win=1, Loss=0) and I'd like to include "team name" and "coach name" in the historical data. Ideally the model would identify which teams and coaches are more likely to win.
However, when I run model.fit
and the training_set includes anything other than int/float values (that are statistical in nature, not categorical), it generates the same accuracy for each epoch with a very high loss score.
Here is how I defined my model:
model = keras.Sequential([
keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_ru, bias_initializer=init_ru),
keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_ru, bias_initializer=init_ru),
keras.layers.Dense(256, activation=tf.nn.relu),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(32, activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
model.compile(optimizer=opt,
loss='binary_crossentropy',
metrics=['accuracy'])
It works great if I don't include any categorical data, but I think if I could get it to work with categorical data, it would improve even more.
The standard way to handle categorical data is to create a dictionary of valid values and then convert the category into a one_hot vector.
This is a reasonable introductory article with examples: https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/