Search code examples
audioffmpegmp3audio-streamingsox

How to detect delay or silence in an audio file?


I want to detect silence or delay in audio for a given duration file and remove it. For example, if someone started speaking and then paused for some duration to think.

There's this question but it only detects the silence at the end and doesn't remove it. My colleague suggested sox but I'm not sure if it's the best tool for the job nor how to use it frankly, moreover, the project died in 2015.


Solution

  • The Sox man page describes this in detail.

       silence [-l] above-periods [duration threshold[d|%]
              [below-periods duration threshold[d|%]]
    

    if we start with a sample command:

    sox input.mp3 out.mp3 -S silence -l 1 0.2 1% -1 0.2 1%
    
    `-S`      - show progress
    `silence` - the filter
    `-l`      - leave x amount of each silence in tact
    `1`       - trim from 1st silence [above-periods]
    `0.2`     - amount of each silence to leave untouched [duration]
    `1%`      - test for near absolute (0%) silence [threshold]
    `-1`      - trim silence from the middle of the file [below-periods]
    `0.2`     - amount of each silence to leave untouched [duration]
    `1%`      - test for near absolute (0%) silence [threshold]
    

    The detail courtesy of Sox:

       silence [-l] above-periods [duration threshold[d|%]
              [below-periods duration threshold[d|%]]
    
              Removes  silence  from  the beginning, middle, or end of the audio.  `Si‐
              lence' is determined by a specified threshold.
    
              The above-periods value is used to indicate if audio should be trimmed at
              the  beginning  of the audio. A value of zero indicates no silence should
              be trimmed from the beginning. When specifying a non-zero  above-periods,
              it trims audio up until it finds non-silence. Normally, when trimming si‐
              lence from beginning of audio the above-periods will be 1 but it  can  be
              increased  to  higher  values to trim all audio up to a specific count of
              non-silence periods. For example, if you had an audio file with two songs
              that each contained 2 seconds of silence before the song, you could spec‐
              ify an above-period of 2 to strip out both silence periods and the  first
              song.
    
              When  above-periods  is  non-zero,  you  must also specify a duration and
              threshold. duration indicates the amount of time that non-silence must be
              detected  before  it  stops  trimming  audio. By increasing the duration,
              burst of noise can be treated as silence and trimmed off.
    
              threshold is used to indicate what sample value you should treat  as  si‐
              lence.   For  digital  audio,  a  value  of  0  may be fine but for audio
              recorded from analog, you may wish to increase the value to  account  for
              background noise.
    
              When optionally trimming silence from the end of the audio, you specify a
              below-periods count.  In this case, below-period means to remove all  au‐
              dio  after  silence is detected.  Normally, this will be a value 1 of but
              it can be increased to skip over periods of silence that are wanted.  For
              example, if you have a song with 2 seconds of silence in the middle and 2
              second at the end, you could set below-period to a value  of  2  to  skip
              over the silence in the middle of the audio.
    
              For below-periods, duration specifies a period of silence that must exist
              before audio is not copied any more.  By specifying  a  higher  duration,
              silence  that  is  wanted  can be left in the audio.  For example, if you
              have a song with an expected 1 second of silence in the middle and 2 sec‐
              onds of silence at the end, a duration of 2 seconds could be used to skip
              over the middle silence.
    
              Unfortunately, you must know the length of the silence at the end of your
              audio  file to trim off silence reliably.  A workaround is to use the si‐
              lence effect in combination with the reverse effect.  By first  reversing
              the  audio, you can use the above-periods to reliably trim all audio from
              what looks like the front of the file.  Then reverse the  file  again  to
              get back to normal.
    
              To remove silence from the middle of a file, specify a below-periods that
              is negative.  This value is then treated as a positive value and is  also
              used  to  indicate that the effect should restart processing as specified
              by the above-periods, making it suitable for removing periods of  silence
              in the middle of the audio.
              The  option  -l  indicates  that  below-periods  duration length of audio
              should be left intact at the beginning of each period  of  silence.   For
              example,  if you want to remove long pauses between words but do not want
              to remove the pauses completely.
    
              duration is a time specification with the peculiarity that a bare  number
              is interpreted as a sample count, not as a number of seconds.  For speci‐
              fying seconds, either use the t suffix (as in `2t') or  specify  minutes,
              too (as in `0:02').
    
              threshold  numbers  may  be  suffixed  with d to indicate the value is in
              decibels, or % to indicate a percentage of maximum value  of  the  sample
              value (0% specifies pure digital silence).
    

    Finally, a python sample code, to enable monitoring of the output:

    try:
        self.comm = Popen(['sox', self.orig_file, self.new_file, '-S', 'silence',\
                           '-l', '1', '0.1', '1%', '-1', str(self.secs.GetValue()), '1%'],\
                           stdout=PIPE, stderr=STDOUT, universal_newlines=True)
    except Exception as e:
        ......
    

    It's worth pointing out that ffmpeg has the filters silencedetect and silenceremove.
    While I do use silencedetect e.g.:

    ffmpeg -hide_banner -stats -i interview.wav -af silencedetect=noise=0dB:d=3 -vn -sn -dn -f null -
    

    silenceremove e.g.:

    ffmpeg -hide_banner -v quiet -i interview.wav -af silenceremove=stop_periods=-1:stop_duration=2:stop_threshold=-3dB -vn -sn -dn -f wav - | ffplay -hide_banner -v quiet -autoexit -i -
    

    I've found to be less dependable.

    It should also be pointed out, that silence is notoriously difficult to pin down, except on an individual/ad-hoc basis, due to background noise.