I am multiplexing video and audio streams. Video stream comes from generated image data. The audio stream comes from aac file. Some audio files are longer than total video time I set so my strategy to stop audio stream muxer when its time becomes larger than the total video time(the last one I control by number encoded video frames).
I won't put here the whole setup code, but it is similar to muxing.c example from the latest FFMPEG repo. The only difference is that I use audio stream from file,as I said, not from synthetically generated encoded frame. I am pretty sure the issue is in my wrong sync during muxer loop.Here is what I do:
void AudioSetup(const char* audioInFileName)
{
AVOutputFormat* outputF = mOutputFormatContext->oformat;
auto audioCodecId = outputF->audio_codec;
if (audioCodecId == AV_CODEC_ID_NONE) {
return false;
}
audio_codec = avcodec_find_encoder(audioCodecId);
avformat_open_input(&mInputAudioFormatContext,
audioInFileName, 0, 0);
avformat_find_stream_info(mInputAudioFormatContext, 0);
av_dump_format(mInputAudioFormatContext, 0, audioInFileName, 0);
for (size_t i = 0; i < mInputAudioFormatContext->nb_streams; i++) {
if (mInputAudioFormatContext->streams[i]->codecpar->codec_type == AVMEDIA_TYPE_AUDIO) {
inAudioStream = mInputAudioFormatContext->streams[i];
AVCodecParameters *in_codecpar = inAudioStream->codecpar;
mAudioOutStream.st = avformat_new_stream(mOutputFormatContext, NULL);
mAudioOutStream.st->id = mOutputFormatContext->nb_streams - 1;
AVCodecContext* c = avcodec_alloc_context3(audio_codec);
mAudioOutStream.enc = c;
c->sample_fmt = audio_codec->sample_fmts[0];
avcodec_parameters_to_context(c, inAudioStream->codecpar);
//copyparams from input to autput audio stream:
avcodec_parameters_copy(mAudioOutStream.st->codecpar, inAudioStream->codecpar);
mAudioOutStream.st->time_base.num = 1;
mAudioOutStream.st->time_base.den = c->sample_rate;
c->time_base = mAudioOutStream.st->time_base;
if (mOutputFormatContext->oformat->flags & AVFMT_GLOBALHEADER) {
c->flags |= CODEC_FLAG_GLOBAL_HEADER;
}
break;
}
}
}
void Encode()
{
int cc = av_compare_ts(mVideoOutStream.next_pts, mVideoOutStream.enc->time_base,
mAudioOutStream.next_pts, mAudioOutStream.enc->time_base);
if (mAudioOutStream.st == NULL || cc <= 0) {
uint8_t* data = GetYUVFrame();//returns ready video YUV frame to work with
int ret = 0;
AVPacket pkt = { 0 };
av_init_packet(&pkt);
pkt.size = packet->dataSize;
pkt.data = data;
const int64_t duration = av_rescale_q(1, mVideoOutStream.enc->time_base, mVideoOutStream.st->time_base);
pkt.duration = duration;
pkt.pts = mVideoOutStream.next_pts;
pkt.dts = mVideoOutStream.next_pts;
mVideoOutStream.next_pts += duration;
pkt.stream_index = mVideoOutStream.st->index;
ret = av_interleaved_write_frame(mOutputFormatContext, &pkt);
} else
if(audio_time < video_time) {
//5 - duration of video in seconds
AVRational r = { 60, 1 };
auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
if (cmp >= 0) {
mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
return true; //don't mux audio anymore
}
AVPacket a_pkt = { 0 };
av_init_packet(&a_pkt);
int ret = 0;
ret = av_read_frame(mInputAudioFormatContext, &a_pkt);
//if audio file is shorter than stop muxing when at the end of the file
if (ret == AVERROR_EOF) {
mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
return true;
}
a_pkt.stream_index = mAudioOutStream.st->index;
av_packet_rescale_ts(&a_pkt, inAudioStream->time_base, mAudioOutStream.st->time_base);
mAudioOutStream.next_pts += a_pkt.pts;
ret = av_interleaved_write_frame(mOutputFormatContext, &a_pkt);
}
}
Now, the video part is flawless. But if the audio track is longer than video duration, I am getting total video length longer by around 5% - 20%, and it is clear that audio is contributing to that as video frames are finished exactly where there're supposed to be.
The closest 'hack' I came with is this part:
AVRational r = { 60 ,1 };
auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
if (cmp >= 0) {
mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
return true;
}
Here I was trying to compare next_pts
of the audio stream with the total time set for video file,which is 5 seconds. By setting r = {60,1}
I am converting those seconds by the time_base of the audio stream. At least that's what I believe I am doing. With this hack, I am getting very small deviation from the correct movie length when using standard AAC files,that's sample rate of 44100,stereo. But if I test with more problematic samples,like AAC sample rate 16000,mono - then the video file adds almost a whole second to its size.
I will appreciate if someone can point out what I am doing wrong here.
Important note: I don't set duration on for any of the contexts. I control the termination of the muxing session, which is based on video frames count.The audio input stream has duration, of course, but it doesn't help me as video duration is what defines the movie length.
UPDATE:
This is second bounty attempt.
UPDATE 2:
Actually,my audio timestamp of {den,num} was wrong,while {1,1} is indeed the way to go,as explained by the answer. What was preventing it from working was a bug in this line (my bad):
mAudioOutStream.next_pts += a_pkt.pts;
Which must be:
mAudioOutStream.next_pts = a_pkt.pts;
The bug resulted in exponential increment of pts,which caused very early reach to the end of stream (in terms of pts) and therefore caused the audio stream to be terminated much earlier than it supposed to be.
The problem is that you tell it to compare the given audio time with 5
ticks at 60 seconds per tick
. I am actually surprised that it works in some cases, but I guess it really depends on the specific time_base
of the given audio stream.
Let's assume the audio has a time_base
of 1/25
and the stream is at 6
seconds, which is more than you want, so you want av_compare_ts
to return 0
or 1
. Given these conditions, you'll have the following values:
mAudioOutStream.next_pts = 150
mAudioOutStream.enc->time_base = 1/25
Thus you call av_compare_ts
with the following parameters:
ts_a = 150
tb_a = 1/25
ts_b = 5
tb_b = 60/1
Now let's look at the implementation of av_compare_ts
:
int av_compare_ts(int64_t ts_a, AVRational tb_a, int64_t ts_b, AVRational tb_b)
{
int64_t a = tb_a.num * (int64_t)tb_b.den;
int64_t b = tb_b.num * (int64_t)tb_a.den;
if ((FFABS(ts_a)|a|FFABS(ts_b)|b) <= INT_MAX)
return (ts_a*a > ts_b*b) - (ts_a*a < ts_b*b);
if (av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b)
return -1;
if (av_rescale_rnd(ts_b, b, a, AV_ROUND_DOWN) < ts_a)
return 1;
return 0;
}
Given the above values, you get:
a = 1 * 1 = 1
b = 60 * 25 = 1500
Then av_rescale_rnd
is called with these parameters:
a = 150
b = 1
c = 1500
rnd = AV_ROUND_DOWN
Given our parameters, we can actually strip down the entire function av_rescale_rnd
to the following line. (I will not copy the whole function body for av_rescale_rnd
as it is rather long, but you can look at it here.)
return (a * b) / c;
This will return (150 * 1) / 1500
, which is 0
.
Thus av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b
will resolve to true
, because 0
is smaller than ts_b
(5
), and so av_compare_ts
will return -1
, which is exactly not what you want.
If you change your r
to 1/1
it should work, because now your 5
will actually be treated as 5 seconds
:
ts_a = 150
tb_a = 1/25
ts_b = 5
tb_b = 1/1
In av_compare_ts
we now get:
a = 1 * 1 = 1
b = 1 * 25 = 25
Then av_rescale_rnd
is called with these parameters:
a = 150
b = 1
c = 25
rnd = AV_ROUND_DOWN
This will return (150 * 1) / 25
, which is 6
.
6
is greater than 5
, the condition fails, and av_rescale_rnd
is called again, this time with:
a = 5
b = 25
c = 1
rnd = AV_ROUND_DOWN
which will return (5 * 25) / 1
, which is 125
. That is smaller than 150
, thus 1
is returned and voilá your problem is solved.
In case step_size is greater than 1
If the step_size
of your audio stream isn't 1
, you need to modify your r
to account for that, e.g. step_size = 1024
:
r = { 1, 1024 };
Let's quickly recap what happens now:
At ~6 seconds:
mAudioOutStream.next_pts = 282
mAudioOutStream.enc->time_base = 1/48000
av_compare_ts
gets the following parameters:
ts_a = 282
tb_a = 1/48000
ts_b = 5
tb_b = 1/1024
Thus:
a = 1 * 1024 = 1024
b = 1 * 48000 = 48000
And in av_rescale_rnd
:
a = 282
b = 1024
c = 48000
rnd = AV_ROUND_DOWN
(a * b) / c
will give (282 * 1024) / 48000
= 288768 / 48000
which is 6
.
With r={1,1}
you would've gotten 0
again, because it would've calculated (281 * 1) / 48000
.