python keras deep-learning lstm multiclass-classification

How to train LSTM model with variable sequence lengths and multiple feature dimensions?

I'm training a LSTM network model for sign language recognition using mediapipe features.

I'm having problems defining the model since the videos have different lengths. When training the model, errors are appearing. I would need the correct setup of the Keras Layers.

In fact, all solutions I have found are from LSTM models with variable timestep dimension but only one feature (for example a word) or video input with several features but same stable timestep for all videos, but not a single solution for variable timestep and several features.

Dataset: Formed by 35 signs (30 from alphabet signs + 5 number signs)

Each video has a different length, from videos where mediapipe only recognize 4 frames to others that have 111 frames.

For each frame, I'm extracting 21 landmarks from the hand with the mediapipe library , each of these landmarkds have 5 features (x,y,z,visibility and presence) so that makes 21*5 = 105 features per frame.

Model Input:

As the input of LSTM need to be a numpy array with constant number of frames, I have fill the empty spaces with value 0 with the following code, so I can mask them later

X = np.array([video + [[0] * 105] * (length - len(video)) for video in X]).astype('float32')

The result array is one of dimension (3036,111,105) where 3036 is the number of videos in the datset, 111 the timesteps / frames of the video and 105 the number of features of each frame.

Each one of the videos (111 timesteps,105 features)) is like this.

0.85280,0.84741,-0.07237,0.00000,0.00000 ... 0.000
0.83034,0.93954,-0.11003,0.00000,0.00000 ... 1.000
...
0.82979,0.99424,-0.12224,0.00000,0.00000 ... 1.000
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 1.000
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 0.000
...
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 0.000

Model:

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(None, 35)))
model.add(LSTM(64, return_sequences=True, activation='relu'))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(name_classes.keys()), activation='softmax'))

In this case I'm having error ValueError: Input 0 is incompatible with layer lstm: expected shape=(None, None, 35), found shape=[None, 111, 105]

How can I setup correctly the Keras layers, specially the Masking layer?

If I remove the Mask layer I was able to make it work, but then my loss function is always Nan and all predictions are always the same value.

model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(None, 105)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(name_classes.keys()), activation='softmax'))

Note: Some of these signs are dynamic so I'm not interested into taken a single frame for the whole video.

As other solutions I have thought in alternatives of masking like:

Extract only the features of 30 frames from each video. Videos with less frames would be filled with syntetic data (repeated frames of the video)
Or filter out frames with little data (like the one with only 4 frames) for cleaning the dataset

However, I would prefer masking the blank frames of the np array, if possible.

Solution

You're really close. You need to change the Masking layer input size to:

model.add(Masking(mask_value=0, input_shape=(None, 105)))

It gets 105 features over varying number of timesteps. The way masking layer works is if all of those 105 features are 0, it will skip that timestep. From the documentation:

For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).