Best practice for buffering data to be sent on UART

I'm working on an embedded project using an STM32F7 device, writing bare metal C.

I want to be able to send data to a UART at any point in the program for debugging purposes, without blocking while the data is sent. I'm using DMA to try to minimise the cpu time used for this.

Currently I'm filling the data into a FIFO queue, and then initiating a DMA request to send the data directly from the FIFO queue to the UART.

The issue with this is I can't set up the DMA to read from both the start and end of the FIFO buffer, in the case where the middle of the FIFO is unused and a message wraps from the end of the buffer to the start.

The two solutions to this would be to set up the first DMA request to read from the head of the FIFO to the end of the buffer, and then once that is complete, read from the start of the buffer to the tail of the FIFO.

The other way to do it would be to memcpy() out the bytes to be sent to another buffer, where they are all sequential, then initiate a single DMA request to send all the data at once.

Both of these would probably work but I'm looking for insight on what the best approach would be here.

Solution

The implementation I've usually chosen is similar to what you have proposed:

The logging functions creates a text and adds it to circular buffer.
DMA is used for the UART transmission. DMA is setup to send a contiguous chunk of data.
Whenever the DMA finishes, an interrupt is triggered. It first frees up the transmitted data in the circular buffer. Then it checks if more data needs to be transmitted. If so, it is immediately started again with new data.

Pseudo code:

tx_len = 0;

void log_message(const char* msg)
{
    circ_buf_add(msg);
    start_tx();
}

void start_tx()
{
    if (tx_len > 0)
        return; // already transmitting

    const char* start;
    int len;
    circ_buf_get_chunk(&start, &tx_len);
    if (tx_len == 0)
        return;

    uart_tx_dma(start, tx_len);
}

void dma_interrupt_handler()
{
    circ_buf_remove(tx_len);
    tx_len = 0;
    start_tx();
}

It usually makes sense to limit the length of the transmitted chunk. The shorter it is, the sooner space is freed up in the circular buffer.