Search code examples
c++audiovideoffmpeglibavformat

FFMPEG: multiplexing streams with different duration


I am multiplexing video and audio streams. Video stream comes from generated image data. The audio stream comes from aac file. Some audio files are longer than total video time I set so my strategy to stop audio stream muxer when its time becomes larger than the total video time(the last one I control by number encoded video frames).

I won't put here the whole setup code, but it is similar to muxing.c example from the latest FFMPEG repo. The only difference is that I use audio stream from file,as I said, not from synthetically generated encoded frame. I am pretty sure the issue is in my wrong sync during muxer loop.Here is what I do:

void AudioSetup(const char* audioInFileName)
{
    AVOutputFormat* outputF = mOutputFormatContext->oformat;
    auto audioCodecId = outputF->audio_codec;

    if (audioCodecId == AV_CODEC_ID_NONE) {
        return false;
    }

    audio_codec = avcodec_find_encoder(audioCodecId);

    avformat_open_input(&mInputAudioFormatContext,
    audioInFileName, 0, 0);
    avformat_find_stream_info(mInputAudioFormatContext, 0);

    av_dump_format(mInputAudioFormatContext, 0, audioInFileName, 0);


    for (size_t i = 0; i < mInputAudioFormatContext->nb_streams; i++) {
        if (mInputAudioFormatContext->streams[i]->codecpar->codec_type == AVMEDIA_TYPE_AUDIO) {
            inAudioStream = mInputAudioFormatContext->streams[i];

            AVCodecParameters *in_codecpar = inAudioStream->codecpar;
            mAudioOutStream.st = avformat_new_stream(mOutputFormatContext, NULL);
            mAudioOutStream.st->id = mOutputFormatContext->nb_streams - 1;
            AVCodecContext* c = avcodec_alloc_context3(audio_codec);
            mAudioOutStream.enc = c;
            c->sample_fmt = audio_codec->sample_fmts[0];
            avcodec_parameters_to_context(c, inAudioStream->codecpar);
            //copyparams from input to autput audio stream:
            avcodec_parameters_copy(mAudioOutStream.st->codecpar, inAudioStream->codecpar);

            mAudioOutStream.st->time_base.num = 1;
            mAudioOutStream.st->time_base.den = c->sample_rate;

            c->time_base = mAudioOutStream.st->time_base;

            if (mOutputFormatContext->oformat->flags & AVFMT_GLOBALHEADER) {
                c->flags |= CODEC_FLAG_GLOBAL_HEADER;
            }
            break;
        }
    }
}

void Encode()
{
    int cc = av_compare_ts(mVideoOutStream.next_pts, mVideoOutStream.enc->time_base,
    mAudioOutStream.next_pts, mAudioOutStream.enc->time_base);

    if (mAudioOutStream.st == NULL || cc <= 0) {
        uint8_t* data = GetYUVFrame();//returns ready video YUV frame to work with
        int ret = 0;
        AVPacket pkt = { 0 };
        av_init_packet(&pkt);
        pkt.size = packet->dataSize;
        pkt.data = data;
        const int64_t duration = av_rescale_q(1, mVideoOutStream.enc->time_base, mVideoOutStream.st->time_base);

        pkt.duration = duration;
        pkt.pts = mVideoOutStream.next_pts;
        pkt.dts = mVideoOutStream.next_pts;
        mVideoOutStream.next_pts += duration;

        pkt.stream_index = mVideoOutStream.st->index;
        ret = av_interleaved_write_frame(mOutputFormatContext, &pkt);
    } else
    if(audio_time <  video_time) {
        //5 -  duration of video in seconds
        AVRational r = {  60, 1 };

        auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
        if (cmp >= 0) {
            mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
            return true; //don't mux audio anymore
        }

        AVPacket a_pkt = { 0 };
        av_init_packet(&a_pkt);

        int ret = 0;
        ret = av_read_frame(mInputAudioFormatContext, &a_pkt);
        //if audio file is shorter than stop muxing when at the end of the file
        if (ret == AVERROR_EOF) {
            mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max(); 
            return true;
        }
        a_pkt.stream_index = mAudioOutStream.st->index;

        av_packet_rescale_ts(&a_pkt, inAudioStream->time_base, mAudioOutStream.st->time_base);
        mAudioOutStream.next_pts += a_pkt.pts;

        ret = av_interleaved_write_frame(mOutputFormatContext, &a_pkt);
    }
}

Now, the video part is flawless. But if the audio track is longer than video duration, I am getting total video length longer by around 5% - 20%, and it is clear that audio is contributing to that as video frames are finished exactly where there're supposed to be.

The closest 'hack' I came with is this part:

AVRational r = {  60 ,1 };
auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
if (cmp >= 0) {
    mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
    return true;
} 

Here I was trying to compare next_pts of the audio stream with the total time set for video file,which is 5 seconds. By setting r = {60,1} I am converting those seconds by the time_base of the audio stream. At least that's what I believe I am doing. With this hack, I am getting very small deviation from the correct movie length when using standard AAC files,that's sample rate of 44100,stereo. But if I test with more problematic samples,like AAC sample rate 16000,mono - then the video file adds almost a whole second to its size. I will appreciate if someone can point out what I am doing wrong here.

Important note: I don't set duration on for any of the contexts. I control the termination of the muxing session, which is based on video frames count.The audio input stream has duration, of course, but it doesn't help me as video duration is what defines the movie length.

UPDATE:

This is second bounty attempt.

UPDATE 2:

Actually,my audio timestamp of {den,num} was wrong,while {1,1} is indeed the way to go,as explained by the answer. What was preventing it from working was a bug in this line (my bad):

     mAudioOutStream.next_pts += a_pkt.pts;

Which must be:

     mAudioOutStream.next_pts = a_pkt.pts;

The bug resulted in exponential increment of pts,which caused very early reach to the end of stream (in terms of pts) and therefore caused the audio stream to be terminated much earlier than it supposed to be.


Solution

  • The problem is that you tell it to compare the given audio time with 5 ticks at 60 seconds per tick. I am actually surprised that it works in some cases, but I guess it really depends on the specific time_base of the given audio stream.

    Let's assume the audio has a time_base of 1/25 and the stream is at 6 seconds, which is more than you want, so you want av_compare_ts to return 0 or 1. Given these conditions, you'll have the following values:

    mAudioOutStream.next_pts = 150
    mAudioOutStream.enc->time_base = 1/25
    

    Thus you call av_compare_ts with the following parameters:

    ts_a = 150
    tb_a = 1/25
    ts_b = 5
    tb_b = 60/1
    

    Now let's look at the implementation of av_compare_ts:

    int av_compare_ts(int64_t ts_a, AVRational tb_a, int64_t ts_b, AVRational tb_b)
    {
        int64_t a = tb_a.num * (int64_t)tb_b.den;
        int64_t b = tb_b.num * (int64_t)tb_a.den;
        if ((FFABS(ts_a)|a|FFABS(ts_b)|b) <= INT_MAX)
            return (ts_a*a > ts_b*b) - (ts_a*a < ts_b*b);
        if (av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b)
            return -1;
        if (av_rescale_rnd(ts_b, b, a, AV_ROUND_DOWN) < ts_a)
            return 1;
        return 0;
    }
    

    Given the above values, you get:

    a = 1 * 1 = 1
    b = 60 * 25 = 1500
    

    Then av_rescale_rnd is called with these parameters:

    a = 150
    b = 1
    c = 1500
    rnd = AV_ROUND_DOWN
    

    Given our parameters, we can actually strip down the entire function av_rescale_rnd to the following line. (I will not copy the whole function body for av_rescale_rnd as it is rather long, but you can look at it here.)

    return (a * b) / c;
    

    This will return (150 * 1) / 1500, which is 0.

    Thus av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b will resolve to true, because 0 is smaller than ts_b (5), and so av_compare_ts will return -1, which is exactly not what you want.

    If you change your r to 1/1 it should work, because now your 5 will actually be treated as 5 seconds:

    ts_a = 150
    tb_a = 1/25
    ts_b = 5
    tb_b = 1/1
    

    In av_compare_ts we now get:

    a = 1 * 1 = 1
    b = 1 * 25 = 25
    

    Then av_rescale_rnd is called with these parameters:

    a = 150
    b = 1
    c = 25
    rnd = AV_ROUND_DOWN
    

    This will return (150 * 1) / 25, which is 6.

    6 is greater than 5, the condition fails, and av_rescale_rnd is called again, this time with:

    a = 5
    b = 25
    c = 1
    rnd = AV_ROUND_DOWN
    

    which will return (5 * 25) / 1, which is 125. That is smaller than 150, thus 1 is returned and voilá your problem is solved.

    In case step_size is greater than 1

    If the step_size of your audio stream isn't 1, you need to modify your r to account for that, e.g. step_size = 1024:

    r = { 1, 1024 };
    

    Let's quickly recap what happens now:

    At ~6 seconds:

    mAudioOutStream.next_pts = 282
    mAudioOutStream.enc->time_base = 1/48000
    

    av_compare_ts gets the following parameters:

    ts_a = 282
    tb_a = 1/48000
    ts_b = 5
    tb_b = 1/1024
    

    Thus:

    a = 1 * 1024 = 1024
    b = 1 * 48000 = 48000
    

    And in av_rescale_rnd:

    a = 282
    b = 1024
    c = 48000
    rnd = AV_ROUND_DOWN
    

    (a * b) / c will give (282 * 1024) / 48000 = 288768 / 48000 which is 6.

    With r={1,1} you would've gotten 0 again, because it would've calculated (281 * 1) / 48000.