deep-learning speech-recognition speech-to-text convolution

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:

We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found that the same architecture, a variation of that from [10], works well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms, stacked with delta and delta-delta features. The output softmax of all models predicts one of 90 symbols, described in detail in Section 4, that includes English and Spanish lowercase letters. The encoder is composed of a total of 8 layers. The input features are organized as a T × 80 × 3 tensor, i.e. raw features, deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers with ReLU activations, each consisting of 32 kernels with shape 3 × 3 × depth in time × frequency. These are both strided by 2 × 2, downsampling the sequence in time by a total factor of 4, decreasing the computation performed in the following layers. Batch normalization [26] is applied after each layer.

As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?

Thanks in advance for your answer!

Solution

I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.