Search code examples
c#.netazureazure-cognitive-services

Azure Speech SDK Speech-to-Text to Stream Audio Segments


I have been working with Azure's Speech-To-Text service found here, using the recognize from in-memory stream method. Essentially what I plan to do is stream only certain segments of the audio to the services, but I am not entirely sure on how to do so. Say I have a video of length 5 minutes and my goal is to only stream the first 30 seconds or even just from the 1 minute mark to the 3 minute mark in the audio file, what would I need to enable or change in the following code to do so?

I have attempted to use CreatePullStream() instead of CreatePushStream() providing the mark in seconds, but it did not produce the goal that I have described above. If anyone knows, please let me know how I can achieve this, much thanks!

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

class Program 
{
    async static Task FromStream(SpeechConfig speechConfig)
    {
        var reader = new BinaryReader(File.OpenRead("audioFile.wav"));
        using var audioInputStream = AudioInputStream.CreatePushStream();
        using var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
        using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);

        byte[] readBytes;
        do
        {
            readBytes = reader.ReadBytes(1024);
            audioInputStream.Write(readBytes, readBytes.Length);
        } while (readBytes.Length > 0);

        var result = await recognizer.RecognizeOnceAsync();
        Console.WriteLine($"RECOGNIZED: Text={result.Text}");
    }

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
        await FromStream(speechConfig);
    }
}

Solution

  • You can just use NAudio.Wave to cut your source .wav files. For instance, if you want to recognize 1 min - 3 min content of a .wav file, try code below:

    using System;
    using System.IO;
    using System.Threading.Tasks;
    using Microsoft.CognitiveServices.Speech;
    using Microsoft.CognitiveServices.Speech.Audio;
    using NAudio.Wave;
    
    public class Program
    {
        public async static Task FromStream(SpeechConfig speechConfig)
        {
            var inputAudioPath = @"<path>";
            var outputAudioPath = @"<path>";
            var startAt = new TimeSpan(0, 1, 0); //start at 1 min
            var duration = new TimeSpan(0, 2, 0); //cut 1-3 min audio, it lasts 2 mins
    
            CutAudio(inputAudioPath, outputAudioPath, startAt, duration);
    
            var reader = new BinaryReader(File.OpenRead(outputAudioPath));
            var audioInputStream = AudioInputStream.CreatePushStream();
            var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
            var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
    
            byte[] readBytes;
            do
            {
                readBytes = reader.ReadBytes(1024);
                audioInputStream.Write(readBytes, readBytes.Length);
            } while (readBytes.Length > 0);
    
            var result = await recognizer.RecognizeOnceAsync();
            Console.WriteLine($"RECOGNIZED: Text={result.Text}");
        }
    
    
        public static void CutAudio(String inputPath, String destPath, TimeSpan startAt, TimeSpan duration)
        {
            using (var reader = new AudioFileReader(inputPath))
            {
                reader.CurrentTime = startAt; // jump forward to the position we want to start from
                WaveFileWriter.CreateWaveFile16(destPath, reader.Take(duration));
            }
        }
    
        public async static Task Main(string[] args)
        {
            var speechConfig = SpeechConfig.FromSubscription("<key>", "<region>");
            await FromStream(speechConfig);
        }
    }
    

    Result:

    Result 1 Result 2

    Btw, if you want to recognize long audios, pls see this official doc and my previous post here.