Search code examples
pythonscipydct

Understanding the output of a DCT


I have some trouble understanding the output of the Discrete Cosine Transform. Background: I want to achive a simple audio compression by saving only the most relevant frequencies of a DCT. In order to be somewhat general, I would cut several audio tracks into pieces of a fixed size, say 5 seconds. Then I would do a DCT on each sample and find out which are the most important frequencies among all short snippets.

This however does not work, which might be due to my missunderstanding of the DCT. See for example the images below:

DCT of the first 40s of an audio track[1] DCT of the first 10s of an audio track[2 DCT of the first 40s flipped and concatenated to itself (abc->abccba)[3]

The first image shows the DCT of the first 40 seconds of an audio track (wanted to make it long enough so that I get a good mix of frequencies). The second image shows the DCT of the first ten seconds. The thrird image shows the DCT of a reverse concatination (like abc->abccba) of the first 40 seconds I added a vertical mark at 2e5 for comparison. Samplerate of the music is the usual 44.1 khz

So here are my questions:

  1. What is the frequency that corresponds to an individual value of the DCT-output-vector? Is it bin/2? Like if I have a spike at bin=10000, which frequency in the real world does this correspond to?

  2. Why does the first plot show strong amplitudes for so many more frquencies than the seond? My intuition was that the DCT would yield values for all frequencies up to 44.l khz (so bin number 88.2k if my assumption in #1 is correct), only that the scale of the spikes would be different, which would then make up the difference in the music.

  3. Why does the third plot show strong amplitudes for more frequencies than the first does? I thought that by concatenating the data, I would not get any new frequencies.

As DCTand FFT/DFT are very similar, I tried to learn more about ft (this and this helped), but apparently it didn't suffice.


Solution

  • Figured it out myself. And it was indeed written in the link I posted in the question. The frequency that corresponds to a certain bin_id is given by (bin_id * freq/2) / (N/2). Which essentially boils down to bin_id*1/t with N=freq*t. This means that the plots just have different granularities. So if plot#1 has a high point at position x, plot#2 will likely show a high point at x/4 and plot#3 at x*2

    The image blow shows the data of plot#1 stretched to twice its size (in blue) and the data of plot#3 in yellow

    enter image description here