python numpy audio signal-processing fft

Loss of Data in Short-Time Fourier Transform and Inverse: Need Help Improving Audio Quality

I am currently developing my own audio library and have implemented the Short-Time Fourier Transform (STFT) and its inverse as part of the signal processing pipeline. However, I have noticed that the STFT and its inverse operations seem to be causing a significant loss of data, resulting in very poor audio quality.

First I divide the audio signal into overlapping frames, and for each frame, I apply a window function to reduce spectral leakage. Then, I perform the Fourier Transform on each frame to obtain the frequency-domain representation.

Then as a test, I perform the inverse to get the audio back. I have plotted these results below and you can see the degradation.

Here are the functions:

class AudioLib:
    '''Library of audio processing functions.'''
    def __init__(self, blocksize=1024 * 2):
        self.blocksize = blocksize
        self.window = np.hanning(blocksize)

    def stft(self, audio):
        '''Compute the short-time Fourier transform of the audio.'''
        # Split the audio into overlapping blocks
        num_blocks = len(audio) // self.blocksize
        blocks = np.reshape(audio[:num_blocks * self.blocksize], (num_blocks, self.blocksize))

        # Apply the windowing function to each block
        windowed_blocks = blocks * self.window[np.newaxis, :]

        # Compute the Fourier transform of each block
        spectrum = np.fft.fft(windowed_blocks, axis=1)

        return spectrum
    
    def istft(self, spectrum):
        '''Compute the inverse short-time Fourier transform of the spectrum.'''
        # Compute the inverse Fourier transform of each block
        windowed_blocks = np.fft.ifft(spectrum, axis=1).real

        # Apply overlap-and-add to reconstruct the output signal
        output = np.zeros(len(spectrum) * self.blocksize)
        for i, block in enumerate(windowed_blocks):
            output[i * self.blocksize : (i + 1) * self.blocksize] += block

        return output

I have tried changing the block size and while reducing the size improves the audio somewhat, it is still not perfect and I feel as though my implementation is incorrect. Any help regarding this would be greatly appreciated!

Solution

Like Christoph Rackwitz said, the problem with this STFT implementation is that the blocks are non-overlapping. For invertibility, you want that each block has 50% overlap with the next block.

Here is a possible simple implementation for extracting and overlap-adding the blocks:

# Copyright 2023 Google LLC.
# SPDX-License-Identifier: Apache-2.0

def extract_blocks(audio: np.ndarray, blocksize: int) -> np.ndarray:
  """Extracts blocks with 50% overlap."""
  hop_step = blocksize // 2
  blocks = []
  offset = 0

  while offset + blocksize <= len(audio):
    blocks.append(audio[offset:(offset + blocksize)])
    offset += hop_step

  return np.column_stack(blocks)


def overlap_add_blocks(blocks: np.ndarray) -> np.ndarray:
  """Overlap-adds blocks with 50% overlap."""
  blocksize, num_blocks = blocks.shape
  hop_step = blocksize // 2
  output = np.zeros(blocksize + (num_blocks - 1) * hop_step)
  offset = 0

  for i in range(num_blocks):
    output[offset:(offset + blocksize)] += blocks[:, i]
    offset += hop_step
  
  return output

If you want do to this more efficiently, check out numpy stride_tricks.

Another detail: for accurate invertibility, the window should be such that adding 50%-overlapped translated copies produces 1.0. To do this, np.hanning needs a small correction. Change self.window = np.hanning(blocksize) to

self.window = np.hanning(blocksize + 1)[:blocksize]

This plot shows the difference for blocksize = 32. The thick line is the sum of the windows. Notice that thick line is wiggly for np.hanning(blocksize) but perfectly flat and equal to 1.0 for np.hanning(blocksize + 1)[:blocksize].