I'm training a LSTM network model for sign language recognition using mediapipe features.
I'm having problems defining the model since the videos have different lengths. When training the model, errors are appearing. I would need the correct setup of the Keras Layers.
In fact, all solutions I have found are from LSTM models with variable timestep dimension but only one feature (for example a word) or video input with several features but same stable timestep for all videos, but not a single solution for variable timestep and several features.
Dataset: Formed by 35 signs (30 from alphabet signs + 5 number signs)
Each video has a different length, from videos where mediapipe only recognize 4 frames to others that have 111 frames.
For each frame, I'm extracting 21 landmarks from the hand with the mediapipe library , each of these landmarkds have 5 features (x,y,z,visibility and presence) so that makes 21*5 = 105 features per frame.
Model Input:
As the input of LSTM need to be a numpy array with constant number of frames, I have fill the empty spaces with value 0 with the following code, so I can mask them later
X = np.array([video + [[0] * 105] * (length - len(video)) for video in X]).astype('float32')
The result array is one of dimension (3036,111,105) where 3036 is the number of videos in the datset, 111 the timesteps / frames of the video and 105 the number of features of each frame.
Each one of the videos (111 timesteps,105 features)) is like this.
0.85280,0.84741,-0.07237,0.00000,0.00000 ... 0.000
0.83034,0.93954,-0.11003,0.00000,0.00000 ... 1.000
...
0.82979,0.99424,-0.12224,0.00000,0.00000 ... 1.000
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 1.000
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 0.000
...
0.00000,0.00000, 0.00000,0.00000,0.00000 ... 0.000
Model:
model = Sequential()
model.add(Masking(mask_value=0, input_shape=(None, 35)))
model.add(LSTM(64, return_sequences=True, activation='relu'))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(name_classes.keys()), activation='softmax'))
In this case I'm having error ValueError: Input 0 is incompatible with layer lstm: expected shape=(None, None, 35), found shape=[None, 111, 105]
How can I setup correctly the Keras layers, specially the Masking layer?
If I remove the Mask layer I was able to make it work, but then my loss function is always Nan and all predictions are always the same value.
model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', input_shape=(None, 105)))
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(len(name_classes.keys()), activation='softmax'))
Note: Some of these signs are dynamic so I'm not interested into taken a single frame for the whole video.
As other solutions I have thought in alternatives of masking like:
However, I would prefer masking the blank frames of the np array, if possible.
You're really close. You need to change the Masking layer input size to:
model.add(Masking(mask_value=0, input_shape=(None, 105)))
It gets 105 features over varying number of timesteps. The way masking layer works is if all of those 105 features are 0, it will skip that timestep. From the documentation:
For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).