My data is about genome sequence, basically a long string of "AAATTGCCAA...AA". Total data is 1100 means there is a total of 1100 rows and each row's length is 29907 Here is a pic of my dataFrame. And there is a total of 5 target values.
I converted the data into a NumPy array of float values.
The shape of my data is train_data.shape: (1100,29907)
. Here, how is my data look after I convert it into NumPy array is given below:
array([[1. , 0.25, 0.25, ..., 0. , 0. , 0. ],
[0.75, 0.5 , 0.5 , ..., 0. , 0. , 0. ],
[1. , 0.75, 1. , ..., 0. , 0. , 0. ],
...,
[0.5 , 0.25, 0.75, ..., 0. , 0. , 0. ],
[0.5 , 0.25, 0.75, ..., 0. , 0. , 0. ],
[1. , 0.75, 1. , ..., 0. , 0. , 0. ]], dtype=float32)
We know that the LSTM input requires a 3D or 2D shape (batch_size, time_steps, seq_len)
.
So I reshape the data
train_data=train_data.reshape(1100,29907,1)
Now, when I pass the data into my model, my input shape is input_shape=(29907,1100)
.
The actual model is given below. But when I run the model it gives me a value error, which is
ValueError: Input 0 is incompatible with layer sequential: expected shape=(None, None, 1100), found shape=(1, 29907, 1)
.
I didn't get the error, did my input shape is wrong or the reshaping is wrong? Plus I have used a conv1D in the second layer, Does the shape also have to compatible with both (LSTM and conv1D) layers? In the LSTM layer, I define there will be 5 output (units-5)
as there are 5 target values.
def create_model():
num_classes = 5
model = Sequential([
LSTM(units=16, input_shape=(train_data.shape[1],train_data.shape[2]),
activation='relu', return_sequences=True,
kernel_regularizer=regularizers.l2(0.0001)),
Dropout(rate=0.5),
Conv1D(filters=100, kernel_size=21, strides=1,
padding="same",activation='relu'),
Dropout(rate=0.3),
MaxPooling1D(pool_size=148, strides=1, padding='valid'),
Dropout(rate=0.1),
# LSTM(16,activation='relu'),
Flatten(),
Dense(5, activation='softmax')
])
# compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam(learning_rate= 0.0001),
metrics=['accuracy'])
return model
1D convolution on sequences expects a 3D input whereas the LSTM you are using outputs a 2D array(batch_size, units)
as the argument return_sequences
is set to False
. This argument tells whether to return the output at each time step instead of the final time step. Changing it to True
returns a 3D array(batch_size, time_steps, units)
instead of a 2D array.
LSTM(units=5, input_shape=(29907,1100), activation='relu', return_sequences=True, kernel_regularizer=regularizers.l2(0.0001))
Also just a small thing but don't hardcode the input shape, instead set it to something like input_shape=(X_train.shape[1],X_train.shape[2])
.
Edit:
So in addition to that you would need to expand the dimension of the last channel before passing it to the model, reshape it using this method.
# train_data shape: (1100,29907)
train_data = np.expand_dims(train_data, -1) # new shape = (1100,29907,1)
Do not reshape it like this, you could easily make an error with the number of values in any of the channels:
train_data=train_data.reshape(1100,29907,1)
Then just change the input shapes of the LSTM to the 2nd and 3rd channels of the input data.
num_classes = 5
model = Sequential([
LSTM(units=num_classes, input_shape=(train_data.shape[1],train_data.shape[2]),
activation='relu', return_sequences=True,
kernel_regularizer=regularizers.l2(0.0001)),
Conv1D(filters=100, kernel_size=21, strides=1,
padding="same", activation='relu',
Dropout(rate=0.5),
MaxPooling1D(pool_size=203, strides=1, padding='valid'),
Flatten(),
Dense(num_classes, activation='softmax')
])