I am developing an application that uses voice recognition to help blind people to learn music. For this, I am thinking about using something as DialogFlow.ai, or even Amazon Alexa, in order not to reivent the wheel. Thus, there are times when I want to use the raw audio data in order to check if instruments are tuned. With these technologies, by default, all the audio input is interpreted and, consequently, converted in text. So, is there a way to use the raw audio data instead of interpreting the user speech?
For a number of reasons (mainly security) Amazon Alexa and other similar technologies will not allow you to get the raw input of the user. Using Amazon Alexa as a way to capture the audio input of an instrument is not a plausible way to implement a tuner. You should implement your own way to capture the audio and maybe use it in conjuction with Alexa/DialogFlow for command interpretation.