Dimensions between embedding layer and lstm encoder layer don't match

I am trying to build an encoder-decoder model for text generation. I am using LSTM layers with an embedding layer. I have somehow a problem with the output of the embedding layer to the LSTM encoder layer. The error I get is:

 ValueError: Input 0 of layer lstm is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 13, 128, 512)

My encoder data has shape: (40, 13, 128) = (num_observations, max_encoder_seq_length, vocab_size) the embeddings_size/latent_dim = 512.

My questions are: how could I get "rid" of this 4'th dimension from the embeddings layer to the LSTM encoder layer, or in other words: how should I pass those 4 dimensions to the LSTM layer of the encoder model ? As I am new to this topic, what should I also eventually correct in the decoder LSTM layer ?

I have read at several posts including this, and this one and many others but couldn't find a solution. It seems to me that my problem is not in the model rather in the shape of the data. Any hint or remark with respect to what could potentially be wrong would be more than appreciated. Thank you very much

My model is the following from (this tutorial):

encoder_inputs = Input(shape=(max_encoder_seq_length,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(max_decoder_seq_length,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

# Compile & run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# Note that `decoder_target_data` needs to be one-hot encoded,
# rather than sequences of integers like `decoder_input_data`!
model.fit([encoder_input_data, decoder_input_data],
          decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          shuffle=True,
          validation_split=0.05)

The summary of my model is:

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 13)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 15)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 13, 512)      65536       input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 15, 512)      65536       input_2[0][0]                    
__________________________________________________________________________________________________
lstm (LSTM)                     [(None, 512), (None, 2099200     embedding[0][0]                  
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 15, 512)      2099200     embedding_1[0][0]                
                                                                 lstm[0][1]                       
                                                                 lstm[0][2]                       
__________________________________________________________________________________________________
dense (Dense)                   (None, 15, 128)      65664       lstm_1[0][0]                     
==================================================================================================
Total params: 4,395,136
Trainable params: 4,395,136
Non-trainable params: 0
__________________________________________________________________________________________________

Edit

I am formatting my data in the following way:

for i, text, in enumerate(input_texts):
    words = text.split() #text is a sentence 
    for t, word in enumerate(words):
        encoder_input_data[i, t, input_dict[word]] = 1.

Which gives for such command decoder_input_data[:2]:

array([[[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],
       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]], dtype=float32)

Solution

I am not sure what you are passing to the mode as inputs and outputs, but this is what works. Please note the shapes of the encoder and decoder inputs I am passing. Your inputs need to be in that shape for the model to run.

### INITIAL CONFIGURATION
num_observations = 40
max_encoder_seq_length = 13
max_decoder_seq_length = 15
num_encoder_tokens = 128
num_decoder_tokens = 128
latent_dim = 512
batch_size = 256
epochs = 5

### MODEL DEFINITION
encoder_inputs = Input(shape=(max_encoder_seq_length,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(max_decoder_seq_length,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')


### MODEL INPUT AND OUTPUT SHAPES
encoder_input_data = np.random.random((1000,13))
decoder_input_data = np.random.random((1000,15))
decoder_target_data = np.random.random((1000, 15, 128))

model.fit([encoder_input_data, decoder_input_data],
          decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          shuffle=True,
          validation_split=0.05)

Model: "functional_210"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_176 (InputLayer)          [(None, 13)]         0                                            
__________________________________________________________________________________________________
input_177 (InputLayer)          [(None, 15)]         0                                            
__________________________________________________________________________________________________
embedding_33 (Embedding)        (None, 13, 512)      65536       input_176[0][0]                  
__________________________________________________________________________________________________
embedding_34 (Embedding)        (None, 15, 512)      65536       input_177[0][0]                  
__________________________________________________________________________________________________
lstm_94 (LSTM)                  [(None, 512), (None, 2099200     embedding_33[0][0]               
__________________________________________________________________________________________________
lstm_95 (LSTM)                  (None, 15, 512)      2099200     embedding_34[0][0]               
                                                                 lstm_94[0][1]                    
                                                                 lstm_94[0][2]                    
__________________________________________________________________________________________________
dense_95 (Dense)                (None, 15, 128)      65664       lstm_95[0][0]                    
==================================================================================================
Total params: 4,395,136
Trainable params: 4,395,136
Non-trainable params: 0
__________________________________________________________________________________________________

Epoch 1/5
4/4 [==============================] - 3s 853ms/step - loss: 310.7389 - val_loss: 310.3570
Epoch 2/5
4/4 [==============================] - 3s 638ms/step - loss: 310.6186 - val_loss: 310.3362
Epoch 3/5
4/4 [==============================] - 3s 852ms/step - loss: 310.6126 - val_loss: 310.3345
Epoch 4/5
4/4 [==============================] - 3s 797ms/step - loss: 310.6111 - val_loss: 310.3369
Epoch 5/5
4/4 [==============================] - 3s 872ms/step - loss: 310.6117 - val_loss: 310.3352

The sequence data (text) needs to be passed to the inputs as label encoded sequences. This needs to be done by using something like textvectorizer from keras. Please read more about how to prepare text data for embedding layers and lstms here.