I'm studying the end-to-end architecture for automatic speech recognition systems.
RNN transducer (RNN-T) is one of the popular end-to-end methods, but it is so difficult to train.
Therefore I'm looking for a framework or a toolkit that can help me to easily implement the baseline model and then make modifications as I wish.
Thanks in advance!
For those interested, I'm currently using ESPnet toolkit which mainly focuses on end-to-end speech recognition and end-to-end text-to-speech.