I'm newbie with signal processing and I search on Google many terminology of spectrogram but I can't find any thing talk about the difference of type of spectrogram. Can anyone help me to explain the definition and meaning of diffenrent spectrogram in the picture below plz? Thanks!
P/s: And what about the difference between spectrogram and chroma? What and when chroma use for?
You asked to clarify two terms: Spectrogram and chroma. The short answer is:
A spectrogram is a visualization over time of all frequencies entering in the composition of a given sound. Any sound (waveform) can be broken down into a weighted sum of pure sinusoids, which frequencies belong to a geometric progression (harmonics).
See the last section for an explanation of each type of spectrogram.
Chroma means the 12 equal degrees of an equal-temper chromatic scale.:
A ♯ increases the pitch of a note by a semitone, the interval between any two degrees is equal to a semitone. However this concept is more complex than that: All degrees can also be lowered by a semitone using a flat (♭), so it appears C♯ and D♭ have the same pitch.
While the pitch is the same from an audio engineer standpoint, this equality of sharps and flats doesn't stand in music. For a composer flats and sharps are not equivalent, do not have the same purpose, and one is never used in place of the other. On non-keyboard instruments, when the performer has to create the pitch themselves, C♯ and D♭ have different pitches, meaning the performer doesn't use the equal-temper chromatic scale, but some variant of the Pythagorean scale were the frequencies of all notes, including sharps and flats, are in a 3:2 ratio.
If you are interested in the details, you may continue reading, but you may have to read twice as the concepts behind are surprisingly counter-intuitive.
Spectrogram: Splitting a sound into harmonics
A spectrogram is a visualization of the frequency spectrum, a breakdown of the sound into pure sinusoids of different frequencies. A spectrogram provides a view of how the amplitude of the different frequencies vary according to time. This can be shown on a 2D plot (alternatively a 3D plot) where x is used for time, y for frequency and a color denotes amplitude at any frequency component found in the sound:
Voice spectrogram, source
In these plots, axes can be linear or logarithmic, and frequency axis can even be note names (sometimes referred to as pitch classes) instead of actual frequencies, as each note corresponds to a frequency. In this latter case the plot is rather called a chromagram. See section further below for details about plots used in audio analysis.
Fundamental sinusoid, sound and musical timbre
If we feed a loudspeaker with a sinusoidal signal of 440 Hz, it plays a A, but a very ugly one, because the spectrum of the generated sound includes mostly a pure sinusoid at 440 Hz. The reason is the diaphragm material has been selected to be as linearly elastic as possible, so that the displacement is proportional to the force exerted by the electromagnet. This linearity produces the single-sinusoid spectrum.
However if we saturate the loudspeaker, by giving the input signal an excessive level, the diaphragm starts to deform in a non-linear way, and harmonic distortion appears. Harmonic distortion denotes the addition of vibrations not part of the input signal, now the sound is made of 440 Hz and something else. Tuning the non-linear curve, some more harmonious mix can be obtained.
Musical instruments produce sounds with the help of, e.g. strings or pipes, using non-linear materials, which can't play pure sinusoidal sounds. Hitting the A string of a piano produces a sound which includes a sinusoid at 440 Hz, but also an infinite number of other sinusoids, in a specific proportion determined by the timbre of the instrument. Instruments are designed to produce an harmonious mix of sinusoids. This mix is usually dominated by the fundamental frequency, but this is not necessarily the case, and the human brain is anyway able to "hear" this fundamental frequency, even if it is not present in the mix.
This behavior is the same for the voice. We can't produce a pure sinusoidal sound by whistling or any other way (see how the voice spectrum is rich in the spectrogram above).
Sounds not produced by a periodic wave, e.g. thunder, have no fundamental/harmonics decomposition. These sounds behave more like a random distribution of frequencies (i.e. noise or stochastic resonance).
The pitch mentioned in the following sections always refers to the fundamental frequency, regardless of the actual spectrum of the sound (the whole mix determined by the timbre).
Scales: Diatonic and chromatic scales
Western music is based on the octave interval, i.e. a range between frequencies f and 2f. The next octave contains harmonics 2 of the previous one, and the two are considered the same notes. To fix idea, the first octave starts at 16.35 Hz with note C.
Each octave is divided into seven intervals, using 8 notes: C, D, E, F, G, A, B, C. These pitches are called the C major diatonic scale, this is the scale we all learned at school:
An interval between two notes is measured as the ratio of the frequencies of the notes. Five intervals have the same extent, a tone, and the two other, E-F and B-C, are only half of this value, a semitone.
This division is found in all octaves, as doubling of halving the frequencies doesn't change the ratios. On a piano keyboard, these notes are the white keys. Any interval on this scale is between notes of different names, e.g. C to D.
There is another scale, which divides the octave into 12 nearly equal intervals, using 13 notes. This scale is the chromatic scale, chroma just refers to these notes:
The notes composing the chromatic scale are the notes of the diatonic scale plus sharped notes splitting all full tone intervals into two semitones. On a keyboard, these notes are the black keys.
Adding a semitone with a sharp is approximately equivalent to multiplying the frequency by 1.06 (12th root of 2).
Sharp and flat
In addition of the sharp notes a semitone higher than the unaltered notes (e.g. C♯ vs. C), there are also flat notes a semitone lower than the unaltered notes (e.g. D♭ vs. D).
The semitone between an unaltered note X and one of its alterations (X♯ and X♭) is called a chromatic semitone.
Since C and D are spaced by a tone, C♯ is a chromatic semitone higher than C and D♭ is a chromatic semitone lower than D, we may assume C♯ is the same pitch than D♭ and the tone C to D is equal to two chromatic semitones. But it's not correct. D♭ has actually a lower pitch than C♯.
Comma and inequality of sharp and flat
We now need to introduce a second kind of semitone, the one existing between two notes of different names, e.g. between C and D♭ or between E and F. It is called a diatonic interval.
The length of a chromatic semitone is 5/9 of a tone, the length of a diatonic semitone is 4/9 of a tone. This unit, 1/9 of a tone, is called a comma. Thus the chromatic semitone is 5 commas, the diatonic semitone is 4 commas. The tone, the sum of one chromatic semitone and one diatonic semitone, is 9 commas. E.g. for the tone C to D:
or equivalently
Which can be represented by:
C <-- 4 commas --> D♭ <-- 1 comma --> C♯ <-- 4 commas --> D.
All notes of the chromatic scale (including sharps and flats) can be obtained by starting with a known frequency, e.g. A = 440 Hz, multiplying each time by 3/2 (possibly dividing by 2 to relocate the result into the same octave). The result is a note which is located 3.5 tones above the starting note, which also means this interval spans over 5 notes, and for this reason is called a fifth. This is the way the scale was created by Pythagoras.
This property implies the note of the fifth is the third harmonic, e.g. G (396.5 Hz for octave 3) is the third harmonic of C (264.3 Hz), D (313.2 Hz) is the third harmonic of G, etc. Thus when playing a C, the sound produced by the timbre of the instrument also contains a G in its harmonics. Playing C and G simultaneously produces indeed an harmonious sound.
Now we can introduce the final type of scale, the one you are interested in.
Equal-temper chromatic scale
As seen, D♭ and C♯ are not the same pitch, and they are actually played differently on a violin where pitches are determined by the performer.
But, for simplification of keyboards, these two notes are made equal on a piano, and are played with the same key (the black key between C and D). This correspond to making the chromatic and diatonic semitones equal to 4.5 commas. The scale made of the 12 equal intervals of a semitone is called the equal-tempered chromatic scale.
Note: While the equal-temper is made of strictly equal semitones of 4.5 commas, instruments supposed to be tuned with the equal-temper are actually tuned with unequal semitones. In practical the black key between two white keys corresponds to unequal intervals on the left and right, to correct for material behavior (e.g. piano strings for bass and treble are different and have a different non-linearity). Usually this can be ignored, but if you work with precise measurements of musical sounds, you'll start to see spectrograms with unexpected frequencies, not part of a geometric progression.
The chromatic scale is the finite pool of pitches
Western music, except in rare experiments, is not composed using the chromatic scale. Instead a diatonic scale with more full tones than semitones is built from the pool of the chromatic notes, by selecting a starting note (called the tonic) and a scheme for the intervals to be used, called the mode. Thus when using the B-minor scale, the tone is B and the mode is minor.
Today there are two interval schemes used: Major and minor. With 12 possible starting notes, there are 24 possible diatonic scales.
The major intervals are the ones found when playing white keys and starting at C. The minor intervals are the ones found when playing white keys and starting at A.
Thus C-major and A-minor diatonic scales are played using the same white keys, but starting at different places of the chromatic scale. All other diatonic scales include one or more black keys.
We may wonder what is the purpose of A-minor if it uses exactly the same pitches than C-major. The reason is the first pitch of C-major is C while the first pitch of A-minor is A, and western music is actually composed by giving the first note (degree I called tonic) a specific role, the resolution of dissonant chords. This is the same for the other degrees of the scale, they all have a specific role, and while it is possible to transpose a piece from A-minor to C-major, this transposition breaks the intent of the composer, and the chord cadences used.
Chroma: Big word for a trivial concept
As seen above, chroma, chroma analysis and chroma feature sounds big business, there is nothing to worry about, chroma is the hype wording for saying note or pitch of the equal-temper chromatic scale, the ordinary set of notes used in Western music.
Spectrogram
The spectrogram is a 3D representation, axis x is time, axis y is frequency and axis z is generally amplitude or power (power is generally the square of amplitude). Z value is indicated by the color of the pixel at grid point (x,y).
Any axis, x, y or z can be made logarithmic using decibels. For a power scale it corresponds to the transformation: dB = 10 log (P/P0), where P0 is a reference value, 1 unless otherwise specified. Doubling is +3dB. As power ratios are the square of amplitude ratios, the decibel value for amplitude is dB (amplitude) = 20 log (A/A0).
The graph below shows the power (z as gray scale) expressed in dB for the frequency y (Hz) at time x (x scale is not shown).
The same with gray shades replaced by colors:
The next graph is identical, except the y scale is logarithmic instead of linear, which makes more sense if energy is concentrated at the beginning of the scale (low frequencies), like here under 1 kHz:
This next graph is the same. From the title it seems power is shown instead of amplitude, but visually there is no color difference:
The next graph is similar, except the "constant Q" title likely means power values are computed using a constant-Q transform (CQT):
The CQT (instead of the usual discrete Fourier transforms) might be an attempt to extract more accurately the notes from the signal.
The same data are shown in the graph below, but y is labeled with note names instead of frequencies:
Chromagram
The chromagram is a specific spectrogram where the y axis and the z values are particular.
Scale y includes only the 12 notes of the chromatic scale.
Z value is the summation of all sounds which correspond to each notes, regardless of the octave, so C is the sum of C0 (C in octave 0), plus C1 (twice the frequency of C0), plus C2 (twice the frequency of C1), etc. These notes are all harmonics of C0.
Summing the signals from all octaves indeed discards the actual frequency information, but it makes sense for sounds produced by resonance in non-linear devices. They contains frequency f and harmonics at 2f, 3f, 4f, etc, according to the musical timbre. On the other hand, a chromagram is not suited to study the actual spectrum of the sound.
A chromagram:
What the z axis does represent is not mentioned, possibly it's the amplitude (or power) relative to the maximum found in the signal (around note E).
The last graph is different in that the y axis doesn't show signal pitches but the tempo (beats per minute) of the sample.
Tempogram
The scale is logarithmic. The color indicates how much frequently this number of BPM is detected. More than one value of BPM is detected because there are several notes shorter than a time. The notes repeat at frequencies higher than the actual BPM. Usually the algorithm used to perform the analysis also provides the most probable BPM, taking onset distribution (e.g. librosa).