I have following code that stores raw audio data from wav file in a byte buffer:
BYTE header[74];
fread(&header, sizeof(BYTE), 74, inputFile);
BYTE * sound_buffer;
DWORD data_size;
fread(&data_size, sizeof(DWORD), 1, inputFile);
sound_buffer = (BYTE *)malloc(sizeof(BYTE) * data_size);
fread(sound_buffer, sizeof(BYTE), data_size, inputFile);
Is there any algorithm to determine when the audio track is silent (literally no sound) and when there is some sound level?
Well, your "sound" will be an array of values, whether integer or real - depends on your format.
For the file to be silent or "have no sound" the values in that array will have to be zero, or very close to zero, or worst case scenario - if the audio has bias - the value will stay the same instead of fluctuating around to produce sound waves.
You can write a simple function which returns the delta for a range, in other words the difference between the largest and smallest value, the lower the delta the lower the sound volume.
Or alternatively, you can write a function that returns you the ranges in which the delta is lower than a given threshold.
For the sake of toying, I wrote a nifty class:
template<typename T>
class SilenceFinder {
public:
SilenceFinder(T * data, uint size, uint samples) : sBegin(0), d(data), s(size), samp(samples), status(Undefined) {}
std::vector<std::pair<uint, uint>> find(const T threshold, const uint window) {
auto r = findSilence(d, s, threshold, window);
regionsToTime(r);
return r;
}
private:
enum Status {
Silent, Loud, Undefined
};
void toggleSilence(Status st, uint pos, std::vector<std::pair<uint, uint>> & res) {
if (st == Silent) {
if (status != Silent) sBegin = pos;
status = Silent;
}
else {
if (status == Silent) res.push_back(std::pair<uint, uint>(sBegin, pos));
status = Loud;
}
}
void end(Status st, uint pos, std::vector<std::pair<uint, uint>> & res) {
if ((status == Silent) && (st == Silent)) res.push_back(std::pair<uint, uint>(sBegin, pos));
}
static T delta(T * data, const uint window) {
T min = std::numeric_limits<T>::max(), max = std::numeric_limits<T>::min();
for (uint i = 0; i < window; ++i) {
T c = data[i];
if (c < min) min = c;
if (c > max) max = c;
}
return max - min;
}
std::vector<std::pair<uint, uint>> findSilence(T * data, const uint size, const T threshold, const uint win) {
std::vector<std::pair<uint, uint>> regions;
uint window = win;
uint pos = 0;
Status s = Undefined;
while ((pos + window) <= size) {
if (delta(data + pos, window) < threshold) s = Silent;
else s = Loud;
toggleSilence(s, pos, regions);
pos += window;
}
if (delta(data + pos, size - pos) < threshold) s = Silent;
else s = Loud;
end(s, pos, regions);
return regions;
}
void regionsToTime(std::vector<std::pair<uint, uint>> & regions) {
for (auto & r : regions) {
r.first /= samp;
r.second /= samp;
}
}
T * d;
uint sBegin, s, samp;
Status status;
};
I haven't really tested it but it looks like it should work. However, it assumes a single audio channel, you will have to extend it in order to work with and across multichannel audio. Here is how you use it:
SilenceFinder<audioDataType> finder(audioDataPtr, sizeOfData, sampleRate);
auto res = finder.find(threshold, scanWindow);
// and output the silent regions
for (auto r : res) std::cout << r.first << " " << r.second << std::endl;
Also notice that the way it is implemented right now, the "cut" to silent regions will be very abrupt, such "noise gate" type of filers usually come with attack and release parameters, which smooth out the result. For example there might be 5 seconds of silence with just a tiny pop in the middle, without attack and release parameters, you will get the 5 minutes split in two, and the pop will actually remain, but using those you can implement varying sensitivity to when to cut it off.