Search code examples
pythontensorflowkerasdeep-learninglstm

Didn't understand ,expected shape=(None, None, 1100), found shape=(1, 29907, 1), Confusion about LSTM input shape? (the data is genome sequence)


My data is about genome sequence, basically a long string of "AAATTGCCAA...AA". Total data is 1100 means there is a total of 1100 rows and each row's length is 29907 Here is a pic of my dataFrame. And there is a total of 5 target values.

I converted the data into a NumPy array of float values. The shape of my data is train_data.shape: (1100,29907). Here, how is my data look after I convert it into NumPy array is given below:

array([[1.  , 0.25, 0.25, ..., 0.  , 0.  , 0.  ],
       [0.75, 0.5 , 0.5 , ..., 0.  , 0.  , 0.  ],
       [1.  , 0.75, 1.  , ..., 0.  , 0.  , 0.  ],
       ...,
       [0.5 , 0.25, 0.75, ..., 0.  , 0.  , 0.  ],
       [0.5 , 0.25, 0.75, ..., 0.  , 0.  , 0.  ],
       [1.  , 0.75, 1.  , ..., 0.  , 0.  , 0.  ]], dtype=float32)

We know that the LSTM input requires a 3D or 2D shape (batch_size, time_steps, seq_len).

So I reshape the data

train_data=train_data.reshape(1100,29907,1)

Now, when I pass the data into my model, my input shape is input_shape=(29907,1100).

The actual model is given below. But when I run the model it gives me a value error, which is
ValueError: Input 0 is incompatible with layer sequential: expected shape=(None, None, 1100), found shape=(1, 29907, 1).

I didn't get the error, did my input shape is wrong or the reshaping is wrong? Plus I have used a conv1D in the second layer, Does the shape also have to compatible with both (LSTM and conv1D) layers? In the LSTM layer, I define there will be 5 output (units-5) as there are 5 target values.

def create_model():
    
    num_classes = 5
    model = Sequential([
    LSTM(units=16, input_shape=(train_data.shape[1],train_data.shape[2]), 
    activation='relu', return_sequences=True,  
    kernel_regularizer=regularizers.l2(0.0001)),
    Dropout(rate=0.5),
    Conv1D(filters=100, kernel_size=21, strides=1,
    padding="same",activation='relu'),
    Dropout(rate=0.3),
    MaxPooling1D(pool_size=148, strides=1, padding='valid'),
    Dropout(rate=0.1),
    # LSTM(16,activation='relu'),
    Flatten(),
    Dense(5, activation='softmax')
    ])

    # compile the model
    model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam(learning_rate= 0.0001),
              metrics=['accuracy'])

    return model

Solution

  • 1D convolution on sequences expects a 3D input whereas the LSTM you are using outputs a 2D array(batch_size, units) as the argument return_sequences is set to False. This argument tells whether to return the output at each time step instead of the final time step. Changing it to True returns a 3D array(batch_size, time_steps, units) instead of a 2D array.

    LSTM(units=5, input_shape=(29907,1100), activation='relu', return_sequences=True, kernel_regularizer=regularizers.l2(0.0001))
    

    Also just a small thing but don't hardcode the input shape, instead set it to something like input_shape=(X_train.shape[1],X_train.shape[2]).

    Edit:

    So in addition to that you would need to expand the dimension of the last channel before passing it to the model, reshape it using this method.

    # train_data shape: (1100,29907)
    train_data = np.expand_dims(train_data, -1) # new shape = (1100,29907,1)
    

    Do not reshape it like this, you could easily make an error with the number of values in any of the channels:

    train_data=train_data.reshape(1100,29907,1)
    

    Then just change the input shapes of the LSTM to the 2nd and 3rd channels of the input data.

    num_classes = 5
    model = Sequential([
        LSTM(units=num_classes, input_shape=(train_data.shape[1],train_data.shape[2]), 
        activation='relu', return_sequences=True, 
        kernel_regularizer=regularizers.l2(0.0001)),
        Conv1D(filters=100, kernel_size=21, strides=1,
        padding="same", activation='relu',
        Dropout(rate=0.5),
        MaxPooling1D(pool_size=203, strides=1, padding='valid'),
        Flatten(),
        Dense(num_classes, activation='softmax')
    ])