Make Whisper use the LAST 30 sec chunk (and not the first)

According to Whisper, the notion is as follows:

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

It is mentioned that only the first 30-sec window is considered for further analysis (and thus language allocation). However, what if I would like to take into account (for the language allocation task) only the last 30-sec window? What could be the possible solution for the task?

Solution

The answer is simple: to detect the language of a file from the last 30 secs (and not the first ones), one can do the following:

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio[-480000:]).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")