I've been working on a part of my app for the past few days where I need to simultaneously play and record an audio file. The task I need to accomplish is just to compare the recording to the audio file played and return a matching percentage. Here's what I have done so far and some context to my questions:
And below are a few questions that I have:
- When I record the audio using AudioRecord, is the format PCM by default or do I need to specify this some how?
- I'm trying to pass the recording to the FFT class in order to acquire the frequency domain data to perform my matching analysis. Is there a way to do this without saving the recording on the user's device?
- After performing the FFT analysis on both files, do I need to store the data in a text file in order to perform the matching analysis? What are some options or possible ways to do this?
- After doing a fair amount of research, all the sources that I found cover how to match the recording with a song/music contained within a data base. My goal is to see how closely two specific audio files match, how would I go about this? - Do I need to create/use hash functions in order to accomplish my goal? A detailed answer to this would be really helpful
- Currently I have a separate thread for recording; separate activity for decoding the audio file; separate activity for the FFT analysis. I plan to run the matching analysis in a separate thread as well or an AsyncTask. Do you think this structure is optimal or is there a better way to do it? Also, should I pass my audio file to the decoder in a separate thread as well or can I do it in the recording thread or MatchingAnalysis thread?
- Do I need to perform windowing in my operations on audio files before I can do matching comparison?
- Do I need to decode the .wav file or can I just compare 2 .wav files directly instead?
- Do I need to perform low-pitching operations on audio files before comparison?
- In order to perform my matching comparison, what data exactly do I need to generate (power spectrum, energy spectrum, spectrogram etc)?
Am I going about this the right way or am I missing something?
In apps like Shazam, Midomi audio matching is done using technique called audio-fingerprinting which uses spectrogram and hashing.
- Your first step to find FFT is correct, but then you will need to make a 2d graph between time and frequency called Spectrogram.
- This spectrogram array contains more than million samples, and we can't work upon this much data. So we find peak in amplitudes. A peak will be a (time, frequency) pair corresponding to an amplitude value which is the greatest in a local neighborhood around it. The peak finding will be a computationally expensive process, and different apps or projects do this in different way. We use peaks because these will be more insensitive to background noise.
- Now different songs can have same peaks, but difference will be order and time difference of occurring. So we combine these peaks into unique hashes and save them in database.
- Perform the above process for each of the audio file you want your app to recognise and match them from your database. Though matching is not simple, and time difference should also be taken into account because song can be from any instant, and we have fingerprint of full song. But it is not a problem because fingerprint contains relative time difference.
It is somewhat detailed process and you can find more explanation in this link http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
There are some libraries that can do it for you dejavu (https://github.com/worldveil/dejavu) and chromaprint (Its in c++). Musicg by google is in java, but it don't perform well with background noise.
Matching two audio files is a complicated process, and like above comments I will also tell you to try first on PC then on phones.