Search code examples
pythonnlpopenai-whisper

Make Whisper use the LAST 30 sec chunk (and not the first)


According to Whisper, the notion is as follows:

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

It is mentioned that only the first 30-sec window is considered for further analysis (and thus language allocation). However, what if I would like to take into account (for the language allocation task) only the last 30-sec window? What could be the possible solution for the task?


Solution

  • The answer is simple: to detect the language of a file from the last 30 secs (and not the first ones), one can do the following:

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio[-480000:]).to(model.device)
    
    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")