Search code examples
c#speech-recognitiontext-to-speechazure-cognitive-servicesspeech-to-text

Azure Speech to Text and TTS is talking to itself


Hopefully this is an "ohh, that was simpler than I made it be".. but I seem to not be able to have duplex Text to Speech and Speech to Text using Azure C# without the 'talking' being heard by the 'listening'.. Creates a bit of an infinite loop...

Question: Is there a way to filter out the voice of the application so it doesn't hear itself, but can hear if a user interrupts it and process incoming audio?

I realize a headphone set may fix this, but I kinda need it on open speakers..

Any help or direction really appreciated! Thanks!

So far I have a pretty standard function to listen to audio via the mic, and stream the text found to an event

        public async Task Listen()
        {
          
            var stopRecognition = new TaskCompletionSource<int>(TaskCreationOptions.RunContinuationsAsynchronously);
            using var audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT);
            using var audioInput = AudioConfig.FromDefaultMicrophoneInput(audioProcessingOptions);
             

            using var recognizer = new SpeechRecognizer(Config, audioInput);

            recognizer.Recognized += Recognizer_Recognized;


            await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

            // Waits for completion.
            // Use Task.WaitAny to keep the task rooted.
            Task.WaitAny(new[] { stopRecognition.Task });

            // Stops recognition.
            await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);

        }
  private  void Recognizer_Recognized(object? sender, SpeechRecognitionEventArgs e)
      {
      // push the decoded audio in text to an event to display on screen...
      }

Then, when the application wants to say a few words, it calls the below

THE ISSUE: I could stop the listening here while it talks, but my application tends to talk a lot! So I'd like it to be interrupted from time to time to get a move on with things! .. but if I listen for audio, it hears itself! argh!

  public async Task Talk(string text)
        {
            // To support Chinese Characters on Windows platform
            if (Environment.OSVersion.Platform == PlatformID.Win32NT)
            {
                Console.InputEncoding = System.Text.Encoding.Unicode;
                Console.OutputEncoding = System.Text.Encoding.Unicode;
            }


            // Set the voice name, refer to https://aka.ms/speech/voices/neural for full list.
            Config.SpeechSynthesisVoiceName = "en-AU-CarlyNeural";
            //https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts

            // Creates a speech synthesizer using the default speaker as audio output.
            using var synthesizer = new SpeechSynthesizer(Config);
            using var result = await synthesizer.SpeakTextAsync(text);
            if (result.Reason == ResultReason.SynthesizingAudioCompleted)
            {
                // hmmm
            }

            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = SpeechSynthesisCancellationDetails.FromResult(result); 
            }
        }

Solution

  • The issue you are facing is related to the audio feedback loop caused by the application hearing its own output while trying to listen for incoming audio. To prevent this feedback loop, you can use a technique called "audio ducking" or "audio suppression" to temporarily reduce the microphone's sensitivity when the application is speaking.

    I made some changes to your code and got the text output with input speech.

    Code :

    using System;
    using System.Threading.Tasks;
    using Microsoft.CognitiveServices.Speech;
    using Microsoft.CognitiveServices.Speech.Audio;
    
    public class SpeechService
    {
        private readonly SpeechConfig Config;
        private SpeechRecognizer recognizer;
    
        public SpeechService(string subscriptionKey, string serviceRegion)
        {
            Config = SpeechConfig.FromSubscription(subscriptionKey, serviceRegion);
            recognizer = null;
        }
    
        public async Task Listen()
        {
            var stopRecognition = new TaskCompletionSource<int>(TaskCreationOptions.RunContinuationsAsynchronously);
            using var audioProcessingOptions = AudioProcessingOptions.Create(AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT);
            using var audioInput = AudioConfig.FromDefaultMicrophoneInput(audioProcessingOptions);
    
            recognizer = new SpeechRecognizer(Config, audioInput);
            recognizer.Recognized += Recognizer_Recognized;
    
            await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
            Task.WaitAny(new[] { stopRecognition.Task });
            await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
        }
    
        private void Recognizer_Recognized(object sender, SpeechRecognitionEventArgs e)
        {
            Console.WriteLine($"Recognized: {e.Result.Text}");
        }
    
        public async Task Talk(string text)
        {
            await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
            Config.SpeechSynthesisVoiceName = "en-AU-CarlyNeural";
    
            using var synthesizer = new SpeechSynthesizer(Config);
            using var result = await synthesizer.SpeakTextAsync(text);
            await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
    
            if (result.Reason == ResultReason.SynthesizingAudioCompleted)
            {
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
            }
        }
    
        public async Task Close()
        {
            if (recognizer != null)
            {
                await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
                recognizer.Dispose();
            }
        }
    }
    
    public class Program
    {
        public static async Task Main(string[] args)
        {
            string subscriptionKey = "<speech_key>";
            string serviceRegion = "<speech_region>";
    
            var speechService = new SpeechService(subscriptionKey, serviceRegion);
            await speechService.Listen();
    
            while (true)
            {
                Console.Write("Enter text to speak (or 'exit' to quit): ");
                string input = Console.ReadLine();
    
                if (input.ToLower() == "exit")
                {
                    break;
                }
                await speechService.Talk(input);
            }
    
            await speechService.Close();
        }
    }
    

    Output:

    It ran well, and when I spoke some lines, it gave me the text output below.

    enter image description here