Search code examples
keraslstmpaddingtf.kerasword-embedding

Pad vectors in tf.keras for LSTM


Keras has a preprocessing util to pad sequences, but it assumes that the sequences are integer numbers.

My sequences are vectors (my own embeddings, I do not want to use Keras embeddings), is there any way in which I can pad them to use in a LSTM?

Sequences can be made equal in Python, but the padding methods in Keras provide additional metainformation for layers like LSTM to consider for masking.


Solution

  • this is a possibility to pad an array of float of different length with zeros

    to mask the zeros you can use the masking layer (otherwise remove it)

    I initialize your embeddings in a list because numpy can't handle array of different lenght. in the example, I use 4 samples of different lengths. the relative embeddings are stored in this list list([1,300],[2,300],[3,300],[4,300])

    # recreate your embed
    emb = []
    for i in range(1,5):
        emb.append(np.random.uniform(0,1, (i,300)))
    
    # custom padding function
    def pad(x, max_len):
        new_x = np.zeros((max_len,x.shape[-1]))
        new_x[:len(x),:] = x # post padding
        return new_x
    
    # pad own embeddings
    emb = np.stack(list(map(lambda x: pad(x, max_len=100), emb)))
    
    emb_model = tf.keras.Sequential()
    emb_model.add(tf.keras.layers.Masking(mask_value=0., input_shape=(100, 300)))
    emb_model.add(tf.keras.layers.LSTM(32))
    
    emb_model(emb)