Efficiently concatenating & crossfading .ts files with FFMPEG

I would like to efficiently crossfade multiple transport files together into an m4a file. This means segmenting files, crossfading the 3 seconds at the beginning/end of consecutive transport files, then concatenating the subsets & newly crossfaded overlaps together, to minimize the decoding/re-encoding.

Gyan provides a solution that correctly crossfades a list of audio files. I was able to modify this program to produce the correct output m4a file for my purposes. However, this requires re-encoding the entirety of each audio file. To crossfade 10 audio files (between 3-5 minutes in length each), this solution runs for 8-12 seconds, which does not meet the criteria for serving this audio for my real time/live stream use case.

To avoid this decoding/recoding bottleneck, I've written a program to segment each transport file, crossfade the overlaps, then concatenate all of the relevant components. This program runs within 1-2 seconds for the 10 audio file case above, which does meet my real time use case.

Below is an abbreviated version that concatenates two 10 second transport files (a.ts, b.ts). These files are AAC encoded, in mono, and just contain sine saves at different frequencies.

ffmpeg -i a.ts -map 0 -f segment -segment_times 7 -c:a copy a_%d.ts
ffmpeg -i b.ts -map 0 -f segment -segment_times 3 -c:a copy b_%d.ts

ffmpeg -i a_1.ts -i b_0.ts -filter_complex acrossfade=d=3:c1=qua:c2=qua xfade.m4a
ffmpeg -i xfade.m4a -c:a copy xfade.ts

ffmpeg -i "concat:a_0.ts|xfade.ts|b_1.ts" -c:a copy out.m4a

Note that crossfading the two overlapping ~3 second files (a_1.ts and b_0.ts) requires writing to .m4a, then converting back to .ts. Attempting to crossfade -> .ts, or concatenate .ts and .m4a files resulted in unplayable audio or missing audio within the out.m4a file.

This program produces an audio file that is almost correct (17 seconds of audio, 3 second crossfade between the two files). Below is an image of the wave form. The top is the waveform produced by regular crossfade, with encoding of the entire files (Gyan's solution) for comparison. The bottom is produced by my program.

Note that are 'artifacts' introduced at the boundaries of the crossfade. These little gaps cause the audio file to "dip" in volume at the beginning of the crossfade, and there is a click at the end of the crossfade. These artifacts are NOT present for the regular, inefficient crossfade.

My questions are:

What is causing these artifacts to be introduced?
Examining the xfade.m4a waveform reveals that there is a slight ramp up added to the beginning, but the ffmpeg documentation doesn't reference this. Is this a consequence of muxing/demuxing when converting between .ts / .m4a?
This extra ramp up doesn't account for the clip at the end. Does that concatenation is also introducing artifacts to these .ts files?

Thanks to anyone who reads this.

EDIT: I should add that these artifacts are present no matter what segment times / audio files are used as input. The artifacts also appear to be deterministic; they also appear at the exact same times for the exact same inputs.

Solution

Audio encoders use something called a "bit reservoir", This means individual frames can not be concatenated as they may require bits from the reservoir in another frame. To bootstrap the process some codecs use something called priming samples. These samples compose a dummy frame to prime the bit reservoir. Long story short, seamless concatenating audio means that the the previous frame and the next frame MUST agree on the status and contents of the reservoir. There are techniques to address this, but they ALL require reencoding.

The gaps in your example are the lost bits in the reservoir you cut off and/or the priming samples of the new encode.

TLDR. You're SOL.