audio dimensions conv-neural-network librosa mfcc

Speech recognition with CNNs and Librosa: Can I combine MFCC and audio data?

I'm building CNNs for speech recognition with Librosa. I've extracted MFCCs for each audio file and preprocessed my audio data. The audio data has dimensions of (93894, 8000) and the MFCCs have dimensions of (93894, 26, 16). As they are, I can't feed them into the same models because of their difference in dimensions. I could create separate models, some 1D receiving audio data and some 2D receiving MFCCs, and see which performs best. But I was hoping to feed them all into the same model. Is there a way to do that? Does flattening the MFCCs make any sense?

Solution

Without ensemble architecture, it isn't possible to feed data of different dimensionality into the same neural net. I created different networks to process MFCCs and raw audio data, and for what it's worth, the models operating on just MFCCs were more efficient and accurate.