Search code examples
pythonkeraskeras-layerone-hot-encoding

Lambda layer in Keras with keras.backend.one_hot gives TypeError


I'm trying to train a character level CNN using Keras. I take as input a single word. I have already transformed the words into lists of indices, but when I try to feed it into one_hot, I get a TypeError.

>>> X_train[0]
array([31, 14, 23, 29, 27, 18, 12, 30, 21, 10, 27,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=uint8)
>>> X_train.shape
(2226641, 98)

But when I try to create my model like this:

k_model = Sequential()
k_model.add(Lambda(K.one_hot, arguments={'num_classes': 100}, input_shape=(98,), output_shape=(98,100)))
k_model.add(Conv1D(filters=16, kernel_size=5, strides=1, padding='valid'))

I get TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: uint8, int32, int64.

It's obviously not making it to a point where X_train is even read, so where is it getting a float value?

I would like to have an instance shape of (98, 100), where 100 is the number of classes.

I can't fit the entire dataset in memory.


Solution

  • I would suggest a cleaner solution that would achieve the same result, how about:

    k_model.add(Embedding(num_classes, num_classes,
                          embeddings_initializer='identity',
                          trainable=False,
                          name='onehot'))
    

    You are essentially embedding things, it would make more sense to use one with fixed weights. It also gives you the flexibility to make the embedding trainable in the future.