I have an audio file and a text that corresponds to the speech in this audio file.
Is there any way to match the text to the audio so that I get something like timestamps that show where the words in the text file appear in the audio.
So I have found exactly what I was looking for.
Apparently the technology that matches a given Text to an Audio and returns the exact timestamps is called Forced Alignment.
Here is an extremely useful link to a list of the best forced alignment tools: https://github.com/pettarin/forced-alignment-tools
Personally, I have used Aeneas as it worked really well for me.