Why is liburing write performance lower than expected?

Problem Summary

I am working on a project that requires streaming data to disk at very high speeds on a single Linux server. An fio benchmark using the command below shows that I should be able to get the desired write speeds (> 40 GB/s) using io_uring.

fio --name=seqwrite --rw=write --direct=1 --ioengine=io_uring --bs=128k --numjobs=4 --size=100G --runtime=300 --directory=/mnt/md0/ --iodepth=128 --buffered=0 --numa_cpu_nodes=0 --sqthread_poll=1  --hipri=1

However, I am not able to replicate this performance with my own code, which makes use of the liburing helper library for io_uring. My current write speed is about 9 GB/s. I suspect that the extra overhead of liburing might be the bottleneck, but I have a few questions to ask about my approach before I give up on the much-prettier liburing code.

My approach

Using liburing
Utilizing the submission queue polling feature
NOT queueing gather/scatter io requests with writev(), but rather queueing requests to use the normal write() function to write to disk. (tried gather / scatter IO requests, but this does not seem to have a major impact on my write speeds.)
Multithreaded with one ring per thread

Additional Information

Running a simplified version of this code that makes no use of threading yielded similar results.
My debugger shows that I am creating the number of threads specified in the NUM_JOBS macro. However, it does not tell me about threads that are created by the kernel for sq polling.
My performance declines when running more than two threads
The linux server has 96 CPUs to work with
The data is being written to a RAID0 configuration
I am using bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {printf("%s(%d)\n", comm, pid);}' in a separate terminal, which shows that the kernel thread(s) dedicated to sq polling as active.
I have verified that the data written to disk exactly matches what I expect it to be in size and contents.
I have tried using the IORING_SETUP_ATTACH_WQ flag when setting up the rings. If anything, this slowed things down.
I have tried various block sizes, 128k seems to be the sweet spot

Questions

I expect that the kernel would spin up a single thread per ring to handle sq polling. However, I do not know how to verify this is actually happening. Can I assume that it is?
Why does my performance decrease when running more than two jobs? Is this due to contention between the threads for the file being written to? Maybe it is because there is actually only a single thread working on sq polling that might get bogged down handling requests from multiple rings?
Are there other flags or options I should be using that might help?
Is it time to bight the bullet and use direct io_uring calls?

The Code

The code below is a simplified version that removes a lot of error handling code for the sake of brevity. However, the performance and function of this simplified version is the same as the full-featured code.

The main function

#include <fcntl.h>
#include <liburing.h>
#include <cstring>
#include <thread>
#include <vector>
#include "utilities.h"

#define NUM_JOBS 4 // number of single-ring threads
#define QUEUE_DEPTH 128 // size of each ring
#define IO_BLOCK_SIZE 128 * 1024 // write block size
#define WRITE_SIZE (IO_BLOCK_SIZE * 10000) // Total number of bytes to write
#define FILENAME  "/mnt/md0/test.txt" // File to write to

char incomingData[WRITE_SIZE]; // Will contain the data to write to disk

int main() 
{
    // Initialize variables
    std::vector<std::thread> threadPool;
    std::vector<io_uring*> ringPool;
    io_uring_params params;
    int fds[2];

    int bytesPerThread = WRITE_SIZE / NUM_JOBS;
    int bytesRemaining = WRITE_SIZE % NUM_JOBS;
    int bytesAssigned = 0;
    
    utils::generate_data(incomingData, WRITE_SIZE); // this just fills the incomingData buffer with known data

    // Open the file, store its descriptor
    fds[0] = open(FILENAME, O_WRONLY | O_TRUNC | O_CREAT);
    
    // initialize Rings
    ringPool.resize(NUM_JOBS);
    for (int i = 0; i < NUM_JOBS; i++)
    {
        io_uring* ring = new io_uring;

        // Configure the io_uring parameters and init the ring
        memset(&params, 0, sizeof(params));
        params.flags |= IORING_SETUP_SQPOLL;
        params.sq_thread_idle = 2000;
        io_uring_queue_init_params(QUEUE_DEPTH, ring, &params);
        io_uring_register_files(ring, fds, 1); // required for sq polling

        // Add the ring to the pool
        ringPool.at(i) = ring;
    }
    
    // Spin up threads to write to the file
    threadPool.resize(NUM_JOBS);
    for (int i = 0; i < NUM_JOBS; i++)
    {
        int bytesToAssign = (i != NUM_JOBS - 1) ? bytesPerThread : bytesPerThread + bytesRemaining;
        threadPool.at(i) = std::thread(writeToFile, 0, ringPool[i], incomingData + bytesAssigned, bytesToAssign, bytesAssigned);
        bytesAssigned += bytesToAssign;
    }

    // Wait for the threads to finish
    for (int i = 0; i < NUM_JOBS; i++)
    {
        threadPool[i].join();
    }

    // Cleanup the rings
    for (int i = 0; i < NUM_JOBS; i++)
    {
        io_uring_queue_exit(ringPool[i]);
    }

    // Close the file
    close(fds[0]);

    return 0;
}

The writeToFile() function

void writeToFile(int fd, io_uring* ring, char* buffer, int size, int fileIndex)
{
    io_uring_cqe *cqe;
    io_uring_sqe *sqe;

    int bytesRemaining = size;
    int bytesToWrite;
    int bytesWritten = 0;
    int writesPending = 0;

    while (bytesRemaining || writesPending)
    {
        while(writesPending < QUEUE_DEPTH && bytesRemaining)
        {
            /* In this first inner loop,
             * Write up to QUEUE_DEPTH blocks to the submission queue
             */

            bytesToWrite = bytesRemaining > IO_BLOCK_SIZE ? IO_BLOCK_SIZE : bytesRemaining;
            sqe = io_uring_get_sqe(ring);
            if (!sqe) break; // if can't get a sqe, break out of the loop and wait for the next round
            io_uring_prep_write(sqe, fd, buffer + bytesWritten, bytesToWrite, fileIndex + bytesWritten);
            sqe->flags |= IOSQE_FIXED_FILE;
            
            writesPending++;
            bytesWritten += bytesToWrite;
            bytesRemaining -= bytesToWrite;
            if (bytesRemaining == 0) break;
        }

        io_uring_submit(ring);

        while(writesPending)
        {
            /* In this second inner loop,
             * Handle completions
             * Additional error handling removed for brevity
             * The functionality is the same as with errror handling in the case that nothing goes wrong
             */

            int status = io_uring_peek_cqe(ring, &cqe);
            if (status == -EAGAIN) break; // if no completions are available, break out of the loop and wait for the next round
            
            io_uring_cqe_seen(ring, cqe);

            writesPending--;
        }
    }
}

Solution

Your fio example is using O_DIRECT, your own is doing buffered IO. That's quite a big change... Outside of that, you're also doing polled IO with fio, your example is not. Polled IO would set IORING_SETUP_IOPOLL and ensure that the underlying device has polling configured (see poll_queues=X for nvme). I suspect you end up doing IRQ driven IO anyway with fio, in case that isn't configured correctly to begin with.

A few more notes - fio also sets a few optimal flags, like defer taskrun and single issuer. If the kernel is new enough, that'll make a difference, though nothing crazy for this workload.

And finally, you're using registered files. This is fine obviously, and is a good optimization if you're reusing a file descriptor. But it's not a requirement for SQPOLL, that went away long ago.

In summary, the fio job you are running and the code you wrote do vastly different things. Not an apples to apples comparison.

Edit: fio job is also 4 threads writing to their own file, your example appears to be 4 threads writing to the same file. This will obviously make things worse, particularly since your example is buffered IO and you're just going to end up with a lot of contention on the inode lock because of that.