How to calculate 95% confidence level of Fourier transform in Python?

After calculating the Fast Fourier Transform (FFT) of a time series in Python/Scipy, I am trying to plot the 95% confidence level that for which the power spectrum is different from red or white noise, but haven't found a straightforward way to do so. I tried following this thread: Power spectrum in python - significance levels and wrote the following code to test for a sine function with random noise:

import numpy as np
from scipy.stats import chi2
from scipy.fft import rfft, rfftfreq

x=np.linspace(0,10,500)

data = np.sin(20*np.pi*x)+np.random.rand(500) - 0.5

yf = rfft(data)
xf = rfftfreq(len(data), 1) 
n=len(data)

var=np.var(data)

### degrees of freedom
M=n/2
phi=(2*(n-1)-M/2.)/M       
###values of chi-squared
chi_val_99 = chi2.isf(q=0.01/2, df=phi) #/2 for two-sided test
chi_val_95 = chi2.isf(q=0.05/2, df=phi)



### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)/n, color='k')  
plt.axhline(y=(var/n)*(chi_val_95/phi),color='r',linestyle='--')

But the resulting line lies below all of the power spectrum, as in Fig. 1. What am I doing wrong? Is there another way to get the significance of the FFT power spectrum ?

Solution

Background considerations

I did not read the entire references included in the answer you linked to (and in particular Pankofsky et. al.), but couldn't find an explicit derivation of the formula and exactly under which conditions the results applied. On the other hand I've found a few other references where a derivation could more readily be confirmed.

Based on the answer to this question on dsp.stackexchange.com, if you only had white gaussian noise with unit variance, the squared-amplitude of each Fourier coefficients would have Chi-squared distribution with degree of freedom asymptotically 2 (sum of 2 Gaussians, one for each of the real and imaginary parts of the complex Fourier coefficient, when n >> 1). When the noise does not have unit variance, it follows a more general Gamma distribution (although in this case you can simply think of it as scaling the survival function). For noise with a uniform distribution in the [-0.5,0.5] range, and a sufficiently large number of samples, the distribution can also be approximated by a Gamma distribution thanks to the Central Limit Theorem.

To illustrate and better understand these distribution, we can go through gradually more complex cases.

Frequency domain distribution of random noise

For sake of comparing with the later case of uniformly distributed data we will use a gaussian noise with a matching variance. Since the variance of uniformly distributed data is in the range [-0.5,0.5] is 1/12, this gives us the following data:

data = np.sqrt(1.0/12)*np.random.randn(500)

Now let us check the statistics on the power spectrum. As indicated earlier, the squared magnitude of each frequency coefficient is a random variable with an approximately Gamma distribution. The shape parameter is half the degrees of freedom of a Chi-Squared distribution that could have been used for a unit-variance case (so 1 in this case), and the scale parameter corresponds to the square of the scaling of the time-domain (from linearity the variate yf scales as data, such that np.abs(yf)**2 scales as the square of data). We can validate this by plotting the histogram of the data against the probability density function:

yf = rfft(data)
spectrum = np.abs(yf)**2/len(data)
plt.figure(figsize=(5,5))
plt.hist(spectrum, bins=100, density=True, label='data')
z = np.linspace(0, np.max(spectrum), 100)
plt.plot(z, gamma.pdf(z, 1, scale=1.0/12), 'k', label='$\Gamma(1,{:.3f})$'.format(1.0/12))

As you can see the values are in pretty good agreement:

Going back to the spectrum plot:

# degrees of freedom
phi = 2
###values of chi-squared
chi_val_95 = chi2.isf(q=0.05/2, df=phi) #/2 for two-sided test

### normalization of power spectrum with 1/n
plt.figure(figsize=(5,5))
plt.plot(xf,np.abs(yf)**2/n, color='k')
# the following two lines should overlap
plt.axhline(y=var*(chi_val_95/phi),color='r',linestyle='--')
plt.axhline(y=gamma.isf(q=0.05/2, a=1, scale=var),color='b')

Just changing the data to use a uniform distribution in the [-0.5,0.5] range (with data = np.random.rand(500) - 0.5) gives an almost identical plot, with the confidence level remaining unchanged.

Frequency domain distribution of signal with noise

To get a single threshold value corresponding to a 95% confidence interval where the noise part would fall if you could separate it from the data containing a sinusoidal component and noise (or otherwise stated as the 95% confidence interval of the null-hypothesis that the data is white noise), you would need the variance of the noise. While trying to estimate this variance you may quickly realize that the sinusoidal contributes a non-negligible portion of the overall data's variance. To remove this contribution we could take advantage of the fact that sinusoidal signals are more readily separated in the frequency-domain. So we could simply discard the x% largest values of the spectrum, under the assumption that those are mostly contributed by spike of the sinusoidal component in the frequency-domain. Note that 95 percentile choice below for the outliers is somewhat arbitrary:

# remove outliers
threshold = np.percentile(np.abs(yf)**2, 95) 
filtered = [x for x in np.abs(yf)**2 if x <= threshold]

Then we can get the time-domain variance using Parseval's theorem:

# estimate variance
# In time-domain variance ~ np.sum(data**2)/len(data))
# In frequency-domain, using Parseval's theorem we get np.sum(data**2)/len(data) = np.mean(np.abs(spectrum)**2)/len(data)
var = np.mean(filtered)/len(data)

Note that due to the dynamic range of values across the spectrum, you may prefer to visualize the results on a logarithmic scale:

plt.figure(figsize=(5,5))
plt.plot(xf,10*np.log10(np.abs(yf)**2/n), color='k')  
plt.axhline(y=10*np.log10(gamma.isf(q=0.05/2, a=1, scale=var)),color='r',linestyle='--')

If on the other hand you are trying to obtain a frequency-dependent 95% confidence interval, then you'd need to consider the contribution of the sinusoidal component at each frequency. For sake of simplicity we will assume here that the amplitude of the sinusoidal component and the variance of the noise are known (otherwise we'd first need to estimate these). In this case the distribution gets shifted by the sinusoidal component's contribution:

signal = np.sin(20*np.pi*x)
data = signal + np.random.rand(500) - 0.5
Sf   = rfft(signal)  # Assuming perfect knowledge of the sinusoidal component
yf   = rfft(data)

noiseVar = 1.0/12    # Assuming perfect knowledge of the noise variance
threshold95 = np.abs(Sf)**2/n + gamma.isf(q=0.05/2, a=1, scale=noiseVar)
plt.figure(figsize=(5,5))
plt.plot(xf, 10*np.log10(np.abs(yf)**2/n), color='k')  
plt.plot(xf, 10*np.log10(threshold95), color='r',linestyle='--')

Finally, while I kept the final plots in squared-amplitude units, nothing prevents you from taking the square root and view the corresponding thresholds in amplitude units.

Edit : I've used a gamma(1,s) distribution which is an asymptotically good distribution for data with sufficient number of samples n. For really small data sizes the distribution more closely match a gamma(0.5*(n/(n//2+1)),s) (due to the DC and Nyquist coefficients being purely real, thus having 1 degree of freedom unlike all other coefficients).