keras lstm padding tf.keras word-embedding

Pad vectors in tf.keras for LSTM

Keras has a preprocessing util to pad sequences, but it assumes that the sequences are integer numbers.

My sequences are vectors (my own embeddings, I do not want to use Keras embeddings), is there any way in which I can pad them to use in a LSTM?

Sequences can be made equal in Python, but the padding methods in Keras provide additional metainformation for layers like LSTM to consider for masking.

Solution

this is a possibility to pad an array of float of different length with zeros

to mask the zeros you can use the masking layer (otherwise remove it)

I initialize your embeddings in a list because numpy can't handle array of different lenght. in the example, I use 4 samples of different lengths. the relative embeddings are stored in this list list([1,300],[2,300],[3,300],[4,300])

# recreate your embed
emb = []
for i in range(1,5):
    emb.append(np.random.uniform(0,1, (i,300)))

# custom padding function
def pad(x, max_len):
    new_x = np.zeros((max_len,x.shape[-1]))
    new_x[:len(x),:] = x # post padding
    return new_x

# pad own embeddings
emb = np.stack(list(map(lambda x: pad(x, max_len=100), emb)))

emb_model = tf.keras.Sequential()
emb_model.add(tf.keras.layers.Masking(mask_value=0., input_shape=(100, 300)))
emb_model.add(tf.keras.layers.LSTM(32))

emb_model(emb)