Search code examples
audiovideoffmpegvideo-processing

FFMPEG Video to Audio Conversion Results in Different Durations


I am trying to covert an MP4 file into a mono WAV file sampled at 16,000 Hz.

When I run below code, the duration goes from 00:09:59.99 (MP4) to 00:09:57.64 (WAV). Its original, longer version goes from 00:48:37.46 (MP4) to 00:48:23.38 (WAV).

ffmpeg -i <FILE_NAME>.mp4 -ac 1 -ar 16000 <FILE_NAME>.wav

I've also tried below code. The result is much worse, going from 00:09:59.99 (MP4) to 00:12:56.29 (AAC).

ffmpeg -I <FILE_NAME>.mp4 -vn -acodec copy <FILE_NAME>.aac

Attaching the log:

Report written to "ffmpeg-20200610-093115.log"
Command line:
ffmpeg -i short.mp4 -ac 1 -ar 16000 short.wav -report
ffmpeg version 4.1.1 Copyright (c) 2000-2019 the FFmpeg developers
  built with Apple LLVM version 10.0.0 (clang-1000.11.45.5)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1.1 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags='-I/Library/Java/JavaVirtualMachines/openjdk-11.0.2.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/openjdk-11.0.2.jdk/Contents/Home/include/darwin' --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libmp3lame --enable-libopus --enable-librubberband --enable-libsnappy --enable-libtesseract --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-videotoolbox --disable-libjack --disable-indev=jack --enable-libaom --enable-libsoxr
  libavutil      56. 22.100 / 56. 22.100
  libavcodec     58. 35.100 / 58. 35.100
  libavformat    58. 20.100 / 58. 20.100
  libavdevice    58.  5.100 / 58.  5.100
  libavfilter     7. 40.101 /  7. 40.101
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  3.100 /  5.  3.100
  libswresample   3.  3.100 /  3.  3.100
  libpostproc    55.  3.100 / 55.  3.100
Splitting the commandline.
Reading option '-i' ... matched as input url with argument 'short.mp4'.
Reading option '-ac' ... matched as option 'ac' (set number of audio channels) with argument '1'.
Reading option '-ar' ... matched as option 'ar' (set audio sampling rate (in Hz)) with argument '16000'.
Reading option 'short.wav' ... matched as output url.
Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'.
Finished splitting the commandline.
Parsing a group of options: global .
Applying option report (generate a report) with argument 1.
Successfully parsed a group of options.
Parsing a group of options: input url short.mp4.
Successfully parsed a group of options.
Opening an input file: short.mp4.
[NULL @ 0x7f98a3008200] Opening 'short.mp4' for reading
[file @ 0x7f98a2904440] Setting default whitelist 'file,crypto'
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Format mov,mp4,m4a,3gp,3g2,mj2 probed with size=2048 and score=100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] ISO: File Type Major Brand: mp42
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Unknown dref type 0x206c7275 size 12
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Processing st: 0, edit list 0 - media time: 0, duration: 7679872
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Unknown dref type 0x206c7275 size 12
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Processing st: 1, edit list 0 - media time: 1024, duration: 26459559
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] drop a frame at curr_cts: 0 @ 0
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] Before avformat_find_stream_info() pos: 11213917 bytes read:318782 seeks:1 nb_streams:2
[h264 @ 0x7f98a3808800] nal_unit_type: 7(SPS), nal_ref_idc: 3
[h264 @ 0x7f98a3808800] nal_unit_type: 8(PPS), nal_ref_idc: 3
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] demuxer injecting skip 1024 / discard 0
[aac @ 0x7f98a1008c00] skip 1024 / discard 0 samples due to side data
[h264 @ 0x7f98a3808800] nal_unit_type: 6(SEI), nal_ref_idc: 0
[h264 @ 0x7f98a3808800] nal_unit_type: 5(IDR), nal_ref_idc: 3
[h264 @ 0x7f98a3808800] Format yuv420p chosen by get_format().
[h264 @ 0x7f98a3808800] Reinit context to 640x368, pix_fmt: yuv420p
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] All info found
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f98a3008200] After avformat_find_stream_info() pos: 21961 bytes read:351550 seeks:2 frames:46
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'short.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: isommp41mp42
    creation_time   : 2020-06-10T16:12:17.000000Z
  Duration: 00:09:59.99, start: 0.000000, bitrate: 149 kb/s
    Stream #0:0(eng), 1, 1/12800: Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9], 47 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      creation_time   : 2020-06-10T16:12:17.000000Z
      handler_name    : Core Media Video
    Stream #0:1(eng), 45, 1/44100: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 98 kb/s (default)
    Metadata:
      creation_time   : 2020-06-10T16:12:17.000000Z
      handler_name    : Core Media Audio
Successfully opened the file.
Parsing a group of options: output url short.wav.
Applying option ac (set number of audio channels) with argument 1.
Applying option ar (set audio sampling rate (in Hz)) with argument 16000.
Successfully parsed a group of options.
Opening an output file: short.wav.
[file @ 0x7f98a0c1db40] Setting default whitelist 'file,crypto'
Successfully opened the file.
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
cur_dts is invalid (this is harmless if it occurs once at the start per stream)
[aac @ 0x7f98a100de00] skip 1024 / discard 0 samples due to side data
cur_dts is invalid (this is harmless if it occurs once at the start per stream)
detected 12 logical cores
[graph_0_in_0_1 @ 0x7f98a0e2c4c0] Setting 'time_base' to value '1/44100'
[graph_0_in_0_1 @ 0x7f98a0e2c4c0] Setting 'sample_rate' to value '44100'
[graph_0_in_0_1 @ 0x7f98a0e2c4c0] Setting 'sample_fmt' to value 'fltp'
[graph_0_in_0_1 @ 0x7f98a0e2c4c0] Setting 'channel_layout' to value '0x4'
[graph_0_in_0_1 @ 0x7f98a0e2c4c0] tb:1/44100 samplefmt:fltp samplerate:44100 chlayout:0x4
[format_out_0_0 @ 0x7f98a0e2cb80] Setting 'sample_fmts' to value 's16'
[format_out_0_0 @ 0x7f98a0e2cb80] Setting 'sample_rates' to value '16000'
[format_out_0_0 @ 0x7f98a0e2cb80] Setting 'channel_layouts' to value '0x4'
[format_out_0_0 @ 0x7f98a0e2cb80] auto-inserting filter 'auto_resampler_0' between the filter 'Parsed_anull_0' and the filter 'format_out_0_0'
[AVFilterGraph @ 0x7f98a0c16ac0] query_formats: 4 queried, 6 merged, 3 already done, 0 delayed
[auto_resampler_0 @ 0x7f98a0e2d540] [SWR @ 0x7f98a28e1000] Using fltp internally between filters
[auto_resampler_0 @ 0x7f98a0e2d540] ch:1 chl:mono fmt:fltp r:44100Hz -> ch:1 chl:mono fmt:s16 r:16000Hz
Output #0, wav, to 'short.wav':
  Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: isommp41mp42
    ISFT            : Lavf58.20.100
    Stream #0:0(eng), 0, 1/16000: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s (default)
    Metadata:
      creation_time   : 2020-06-10T16:12:17.000000Z
      handler_name    : Core Media Audio
      encoder         : Lavc58.35.100 pcm_s16le
size=   17152kB time=00:09:16.63 bitrate= 252.4kbits/s speed=1.11e+03x    
[out_0_0 @ 0x7f98a0e2c700] EOF on sink link out_0_0:default.
No more output streams to write to, finishing.
size=   18676kB time=00:09:59.99 bitrate= 255.0kbits/s speed=1.11e+03x    
video:0kB audio:18676kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000408%
Input file #0 (short.mp4):
  Input stream #0:0 (video): 1 packets read (3689 bytes); 
  Input stream #0:1 (audio): 25739 packets read (7375414 bytes); 25738 frames decoded (26355712 samples); 
  Total: 25740 packets (7379103 bytes) demuxed
Output file #0 (short.wav):
  Output stream #0:0 (audio): 25739 frames encoded (9562163 samples); 25739 packets muxed (19124326 bytes); 
  Total: 25739 packets (19124326 bytes) muxed
25738 frames successfully decoded, 0 decoding errors
[AVIOContext @ 0x7f98a0c1dc40] Statistics: 4 seeks, 76 writeouts
[AVIOContext @ 0x7f98a29045c0] Statistics: 10902846 bytes read, 29 seeks

Solution

  • Containers like MP4, MKV store packets with timestamps. One of the byproducts of that is it allows representing silence in audio tracks by simply adjusting timestamps of packets intended to have silence between them. Containers like WAV or raw AAC bitstreams don't have timestamps, so any 'silence' coded in that manner is lost.

    Your input audio is 44100 Hz. In this line near the end of the log,

    Input stream #0:1 (audio): 25739 packets read (7375414 bytes); 25738 frames decoded (26355712 samples); 
    

    you see that the input stream has 26355712 samples. At 44100 Hz, that's ~597.6351 seconds. Which is what you get in the WAV output.

    To insert silence, in order to preserve source duration, use

    ffmpeg -i <FILE_NAME>.mp4 -af aresample=async=1 -ac 1 -ar 16000 <FILE_NAME>.wav