Search code examples
audiosignal-processingtheorymidisoundfont

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?


If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?

The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.

And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.


Solution

  • For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.

    Another possibility is sliding sound fingerprinting, but at an even higher computational cost.

    This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.

    Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).