How to get complete fundamental (f0) frequency extraction with python lib librosa.pyin?

I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.

librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html

sr: 22050

fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')

f0, voiced_flag, voiced_probs = librosa.pyin(y,
                                             fmin=fmin,
                                             fmax=fmax,
                                             pad_mode='constant',
                                             n_thresholds = 10,
                                             max_transition_rate = 100,
                                             sr=sr)

Raw audio:

Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.

link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav

times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)

Another view with power spectrogram:

I tried compressing the audio, but that didn't seem to work.

Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?

What type of things affect fundamental tone extraction success?

Solution

TL;DR It seems like it's all about the parameters tweaking.

Here are some results that I've got playing with the example, it would be better to open it in a separate tab: The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:

There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
Some words are divided into parts (e.g. "fo"-"x").
I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
Two-seconds period is a pretty short amount of time.

Some experiments:

If we want to see a smooth F0 graph, going with n_threshold=1 will do the thing. It's a bad idea. In the "voiced_flag" part of the graphs, we see that for n_threshold=1 it decides that each frame was voiced, counting every frequency change as activity.
Changing the sample rate affects the ability to retrieve F0 (in the rightmost graph, the sample rate was halved), as it was previously mentioned the n_threshold=1 doesn't count, but also we see that n_threshold=100 (which is a default value for pyin) doesn't produce any F0 at all.
Top most left (max_transition_rate=200) and middle (max_transition_rate=100) graphs show the extracted F0 for n_threshold=2 and n_threshold=100. Actually it degrades pretty fast, and n_threshold=3 looks almost the same as n_threshold=100. I find the lower part, the voiced_flag decision plot, has high importance when combined with the phonetics transcript. In the middle graph, default parameters recognise "qui", "jum", "over", "la". If we want F0 for other phonems, n_threshold=2 should do the work.
Setting n_threshold=3+ gives F0s in the same range. Increasing the max_transition_rate adds noice and reluctancy to declare that the voice segment is over.

That's my thoughts. Hope it helps.