According to Whisper, the notion is as follows:
Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
It is mentioned that only the first 30-sec window is considered for further analysis (and thus language allocation). However, what if I would like to take into account (for the language allocation task) only the last 30-sec window? What could be the possible solution for the task?
The answer is simple: to detect the language of a file from the last 30 secs (and not the first ones), one can do the following:
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio[-480000:]).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")