I have been working with Azure's Speech-To-Text service found here, using the recognize from in-memory stream method. Essentially what I plan to do is stream only certain segments of the audio to the services, but I am not entirely sure on how to do so. Say I have a video of length 5 minutes and my goal is to only stream the first 30 seconds or even just from the 1 minute mark to the 3 minute mark in the audio file, what would I need to enable or change in the following code to do so?
I have attempted to use CreatePullStream() instead of CreatePushStream() providing the mark in seconds, but it did not produce the goal that I have described above. If anyone knows, please let me know how I can achieve this, much thanks!
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
class Program
{
async static Task FromStream(SpeechConfig speechConfig)
{
var reader = new BinaryReader(File.OpenRead("audioFile.wav"));
using var audioInputStream = AudioInputStream.CreatePushStream();
using var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
byte[] readBytes;
do
{
readBytes = reader.ReadBytes(1024);
audioInputStream.Write(readBytes, readBytes.Length);
} while (readBytes.Length > 0);
var result = await recognizer.RecognizeOnceAsync();
Console.WriteLine($"RECOGNIZED: Text={result.Text}");
}
async static Task Main(string[] args)
{
var speechConfig = SpeechConfig.FromSubscription("<paste-your-subscription-key>", "<paste-your-region>");
await FromStream(speechConfig);
}
}
You can just use NAudio.Wave
to cut your source .wav files. For instance, if you want to recognize 1 min - 3 min content of a .wav file, try code below:
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using NAudio.Wave;
public class Program
{
public async static Task FromStream(SpeechConfig speechConfig)
{
var inputAudioPath = @"<path>";
var outputAudioPath = @"<path>";
var startAt = new TimeSpan(0, 1, 0); //start at 1 min
var duration = new TimeSpan(0, 2, 0); //cut 1-3 min audio, it lasts 2 mins
CutAudio(inputAudioPath, outputAudioPath, startAt, duration);
var reader = new BinaryReader(File.OpenRead(outputAudioPath));
var audioInputStream = AudioInputStream.CreatePushStream();
var audioConfig = AudioConfig.FromStreamInput(audioInputStream);
var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
byte[] readBytes;
do
{
readBytes = reader.ReadBytes(1024);
audioInputStream.Write(readBytes, readBytes.Length);
} while (readBytes.Length > 0);
var result = await recognizer.RecognizeOnceAsync();
Console.WriteLine($"RECOGNIZED: Text={result.Text}");
}
public static void CutAudio(String inputPath, String destPath, TimeSpan startAt, TimeSpan duration)
{
using (var reader = new AudioFileReader(inputPath))
{
reader.CurrentTime = startAt; // jump forward to the position we want to start from
WaveFileWriter.CreateWaveFile16(destPath, reader.Take(duration));
}
}
public async static Task Main(string[] args)
{
var speechConfig = SpeechConfig.FromSubscription("<key>", "<region>");
await FromStream(speechConfig);
}
}
Result:
Btw, if you want to recognize long audios, pls see this official doc and my previous post here.