Looking to understand RTSP and H.264 Encapsulation

I am trying to learn enough about H.264, RTP, RTSP and encapsulation file formats to develop a video recording application.

Specifically, what should I read to understand the problem?

I want to be able to answer the following questions:

Can I save H.264 packets or NALs (Per RFC 6184) to a file?
Can I save the individual payloads as files?
Can I join the RTP payloads simply by concatenating them?
What transformation is needed to save several seconds of H.264 video in an MP4 container.
What must be done to later join these MP4 files, or arbitrarily split them, or serve them as a new RTSP presentation?

I want to be able to answer these questions on a fairly low level so I can implement software that does some of the processes (capture RTP streams, rebroadcast joined MP4s).

Background

The goal is to record video from a network camera onto disk. The camera has an RTSP server that provides an H.264 encoded stream which it sends via RTP to a player. I have successfully played the stream using VLC, but would like to customize the process.

Solution

The "raw" video stream is a sequence of NAL units, per H.264 specification. Neither on RTSP, nor on MP4 file you have this stream "as is".

On RTSP connection you typically receive NAL units fragmented, and you need to depacketize them (no you cannot simply concatenate):

MP4 file is a container formatted file, and has its own structure (boxes). So you cannot simply stream NALs into such file and you have to do what is called multiplexing.

How do I create an mp4 file from a collection of H.264 frames and audio frames?