Is there a NAO simulation with Speech Recognition?

Due to Covid-19, I don't have access to a physical NAO and need to work with simulations. The goal is to model dialogues of different complexity, also involving gestures. Speech recognition is the most important feature here, but simulation of other features that add more realism (like voice) would be appreciated too.

I am working from a Mac (with Catalina).

What I've tried:

Choregraphe: The included simulation works fine, but is very restricted in its abilities. If I'm not missing something, dialogues are only simulated in a written chat - so I type the speech input, getting 'speech bubbles' as a response
Webots for NAO: No longer supported?
Webots (using Python controllers): The most promising approach so far, but there is basically no documentation on how to write NAO controllers. I could not figure out how to make the Speaker() class work. The robot and world simulation from naoqisim (which is also no longer sustained) seem to run fine.
Webots using ROS controller: There is no official support for Mac, and the recommended installation for ROS Kinetics has not yet worked for me.

I'd appreciate any hint on whether Webots is even suitable for dialogues (seems to be mostly focussed on movement) or advice for other suitable simulations.

Solution

Choregraphe

The ALTextToSpeech and ALSpeechRecognition APIs don't work on the virtual robot unforunately. From the docs here

ACAPELA, microAITalk and Nuance engines are only available on the real robot. When using a virtual robot, said text can be visualized in Choregraphe Robot View and Dialog panel.

and here

[Speech Recognition] cannot be tested on a simulated robot - This module is only available on a real robot, you cannot test it on a simulated robot.

The text interaction can be used to test the flow of your dialogs, but won't allow you to test the nuances of speech recognition properly though.

Other Simulators

Webots is not supported any more, and I've never had any luck getting it set up. The best currently available simulation environment for Pepper/NAO is the ROS Gazebo Stack. But it's really not designed for audio simulation either. It would allow you to simulate the robot making gestures and moving through the world, but you would have to write your own custom code (ROS nodes, in python or C++) to process the audio, do speech recogition, and output speech (connected up to a mic and speakers you have for example).

If you plan to use a NAOqi QiChat chatbot, you could use the naoqi python apis to run that and just connect external speech to text and text to speech services to it. Though it you want more complex speech interactions, I'd suggest a full blown chatbot (Dialogflow, IBM Watson, et c.)