python-3.x audio pytorch compression torchaudio

Torchaudio.save() .wav file is twice bigger than the original .wav file

I'm really new to pytorch and torchaudio. I found that the file it save is twice bigger than the original file. But I just load a .wav file and save the audio to another .wav file immediately. Why it get bigger?

I've check that the bit depth(?), sample rate are all the same. What make the resaved file twice bigger than the original one? Furthemore, it's not totally twice bigger. It's slightly smaller. I can't find other ppl with the same situation. I guess the original is a compressed file? Becuase the resaved file of resave_audio maintain the same size.

Cany anyone give me a hint or keyword about what's this situation?

import os
import torchaudio

ori_audio, ori_sr = torchaudio.load('LJ037-0171.wav')

torchaudio.save('LJ037-0171_resave.wav', ori_audio, ori_sr)

resave_audio, resave_sr = torchaudio.load('LJ037-0171_resave.wav')

print(f'Original sr: {ori_sr}, Resaved sr: {resave_sr}')
print(f'Audio tensor equal: {torch.equal(ori_audio, resave_audio)}')
print(f'datatype of ori_audio: {ori_audio[0, 1].dtype}')
print(f'datatype of ori_audio: {resave_audio[0, 1].dtype}')
print(f'Shape of ori: {ori_audio.shape}')
print(f'Shape of resave: {ori_audio.shape}')

print(f'File size of original wav: {os.path.getsize("LJ037-0171.wav")}')
print(f'File size of resaved wav: {os.path.getsize("LJ037-0171_resave.wav")}')

Output:

Original sr: 22050, Resaved sr: 22050
Audio tensor equal: True
datatype of ori_audio: torch.float32
datatype of ori_audio: torch.float32
Shape of ori: torch.Size([1, 167226])
Shape of resave: torch.Size([1, 167226])
File size of original wav: 334496
File size of resaved wav: 668962

Solution

I was able to replicate it, thanks for the clarity of the code you provided. The difference comes from torchaudio, which uses a default encoding setup if none is specified. I guess your original sample comes from LJSpeech, and in this dataset samples are encoded in 16-bit signed integer PCM, which means each audio sample takes integer value from -32767 to +32768, so each sample takes 16 bits.

torchaudio.load uses by default 32-bit Floating point PCM, which means each sample is a float, on 32 bits, floating point values from -1.0 to +1.0. Since you didn't specify the encoding in torchaudio.load, it changed it to its default, and now each sample is encoded on 32 bits. When you save the result, it is with this 32-bit version, so the file is almost twice the size. You can check the encoding using soxi LJ037-0171_resave.wav

If you want to specify an encoding and bits per sample, you can do it according to the Torchaudio backend doc, and specify bits_per_sample and encoding in your torchaudio.load and torchaudio.save functions. If you use bits_per_sample=16 and encoding=PCM_S (for signed PCM), you should have exactly the same file.