Difference between output of python librosa.core.stft() and matlab spectrogram(x)

I am converting a Python code to MATLAB. The Python code, uses the following command:

stft_ch = librosa.core.stft(audio_input[:, ch_cnt], n_fft=self._nfft, 
                            hop_length=self._hop_len, win_length=self._win_len, 
                            window='hann')

Where audio_input.shape=(2880000, 4), self._nfft=2048, self._hop_len=960 and self._win_len=1920.

When converting to MATLAB I used:

stft_ch = spectrogram(audio_input(:, ch_cnt), hann(win_len), win_len-hop_len, nfft);

where I verified size(audio_input)=2880000, 4, win_len=1920, win_len-hop_len=960 and nfft=2048.

I am getting an output from MATLAB with size(stft_ch)=1025, 2999 where Python shows stft_ch.shape=(1025, 3001). The size 2999 in the MATLAB output is clear and feats the documentation where k = ⌊(Nx – noverlap)/(length(window) – noverlap)⌋ if window is a vector.

However, I could not find in the Python documentation how is the length of t set.

Why is there a difference between sizes? Is my conversion good?

Is there a Python function which produces an output more similar to MATLAB's spectrogram() so that I can get the complex output with the same size?

Solution

I have found the answer myself.

The MATLAB function spectrogram() outputs a vector of times which corresponds to the middle of each window while omitting the last window. For example, a 10 samples length signal with a 3 sample window and 1 sample overlap, will result in the following 4 windows:

1:3,3:5,5:7,7:9, where m:n represents a window including samples from m to n including the nth sample.

The centers for the windows would, therefore, be: 2,4,6,8. Note that the 10th sample is not included.

It seems that MATLAB requires the maximal number_of_windows subjogated to (number_of_windows-1)*hop_length+window_size<=number_of_samples.

On the python version liberosa.core.stft() on the other way, t is the time of the first sample for each frame and the frames covers more than the input signal. for example, a 10 samples length signal with a 3 sample window and 2 sample hops (hops and not overlap), will result in the following 4 windows:

1:3,3:5,5:7,7:9,9:11, where m:n represents a window including samples from m to n including the nth sample.

The beginnings for the windows would, therefore, be: 1,3,5,7,9. Note that the 11th non-existing sample is included.

It seems that liberosa requires the minimal number_of_windows subjogated to number_of_windows*hop_length>number_of_samples.

In my case:

(2999-1)960+1920=2880000<=2880000 for MATLAB. 3001960=2880960>2880000 while 30000*960=2880000 !> 2880000 in python.

Note that the times can be centered in Python by setting center=True flag.

This is the best explanation I could find.