I am currently developing my own audio library and have implemented the Short-Time Fourier Transform (STFT) and its inverse as part of the signal processing pipeline. However, I have noticed that the STFT and its inverse operations seem to be causing a significant loss of data, resulting in very poor audio quality.
First I divide the audio signal into overlapping frames, and for each frame, I apply a window function to reduce spectral leakage. Then, I perform the Fourier Transform on each frame to obtain the frequency-domain representation.
Then as a test, I perform the inverse to get the audio back. I have plotted these results below and you can see the degradation.
Here are the functions:
class AudioLib:
'''Library of audio processing functions.'''
def __init__(self, blocksize=1024 * 2):
self.blocksize = blocksize
self.window = np.hanning(blocksize)
def stft(self, audio):
'''Compute the short-time Fourier transform of the audio.'''
# Split the audio into overlapping blocks
num_blocks = len(audio) // self.blocksize
blocks = np.reshape(audio[:num_blocks * self.blocksize], (num_blocks, self.blocksize))
# Apply the windowing function to each block
windowed_blocks = blocks * self.window[np.newaxis, :]
# Compute the Fourier transform of each block
spectrum = np.fft.fft(windowed_blocks, axis=1)
return spectrum
def istft(self, spectrum):
'''Compute the inverse short-time Fourier transform of the spectrum.'''
# Compute the inverse Fourier transform of each block
windowed_blocks = np.fft.ifft(spectrum, axis=1).real
# Apply overlap-and-add to reconstruct the output signal
output = np.zeros(len(spectrum) * self.blocksize)
for i, block in enumerate(windowed_blocks):
output[i * self.blocksize : (i + 1) * self.blocksize] += block
return output
I have tried changing the block size and while reducing the size improves the audio somewhat, it is still not perfect and I feel as though my implementation is incorrect. Any help regarding this would be greatly appreciated!
Like Christoph Rackwitz said, the problem with this STFT implementation is that the blocks are non-overlapping. For invertibility, you want that each block has 50% overlap with the next block.
Here is a possible simple implementation for extracting and overlap-adding the blocks:
# Copyright 2023 Google LLC.
# SPDX-License-Identifier: Apache-2.0
def extract_blocks(audio: np.ndarray, blocksize: int) -> np.ndarray:
"""Extracts blocks with 50% overlap."""
hop_step = blocksize // 2
blocks = []
offset = 0
while offset + blocksize <= len(audio):
blocks.append(audio[offset:(offset + blocksize)])
offset += hop_step
return np.column_stack(blocks)
def overlap_add_blocks(blocks: np.ndarray) -> np.ndarray:
"""Overlap-adds blocks with 50% overlap."""
blocksize, num_blocks = blocks.shape
hop_step = blocksize // 2
output = np.zeros(blocksize + (num_blocks - 1) * hop_step)
offset = 0
for i in range(num_blocks):
output[offset:(offset + blocksize)] += blocks[:, i]
offset += hop_step
return output
If you want do to this more efficiently, check out numpy stride_tricks.
Another detail: for accurate invertibility, the window should be such that adding 50%-overlapped translated copies produces 1.0. To do this, np.hanning
needs a small correction. Change self.window = np.hanning(blocksize)
to
self.window = np.hanning(blocksize + 1)[:blocksize]
This plot shows the difference for blocksize = 32
. The thick line is the sum of the windows. Notice that thick line is wiggly for np.hanning(blocksize)
but perfectly flat and equal to 1.0 for np.hanning(blocksize + 1)[:blocksize]
.