NVMe Throughput Testing with Python

currently I need to do some throughput testing. My hardware setup is that I have a Samsung 950 Pro connected to an NVMe controller that is hooked to the motherboard via and PCIe port. I have a Linux nvme device corresponding to the device which I have mounted at a location on the filesystem.

My hope was to use Python to do this. I was planning on opening a file on the file system where the SSD is mounted, recording the time, writing some n length stream of bytes to the file, recording the time, then closing the file using os module file operation utilities. Here is the function to gauge write throughput.

def perform_timed_write(num_bytes, blocksize, fd):
    """
    This function writes to file and records the time

    The function has three steps. The first is to write, the second is to
    record time, and the third is to calculate the rate.

    Parameters
    ----------
    num_bytes: int
        blocksize that needs to be written to the file
    fd: string
        location on filesystem to write to

    Returns
    -------
    bytes_per_second: float
        rate of transfer
    """
    # generate random string
    random_byte_string = os.urandom(blocksize)

    # open the file
    write_file = os.open(fd, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)        
    # set time, write, record time
    bytes_written = 0
    before_write = time.clock()
    while bytes_written < num_bytes:
        os.write(write_file, random_byte_string)
        bytes_written += blocksize
    after_write = time.clock()

    #close the file
    os.close(write_file)

    # calculate elapsed time
    elapsed_time = after_write - before_write

    # calculate bytes per second
    bytes_per_second = num_bytes / elapsed_time


    return bytes_per_second

My other method of testing is to use Linux fio utility. https://linux.die.net/man/1/fio

After mounting the SSD at /fsmnt/fs1, I used this jobfile to test the throughput

;Write to 1 file on partition
[global]
ioengine=libaio
buffered=0
rw=write
bs=4k
size=1g
openfiles=1

[file1]
directory=/fsmnt/fs1

I noticed that the write speed returned from the Python function is significantly higher than that of the fio. Because Python is so high-level there is a lot of control you give up. I am wondering if Python is doing something under the hood to cheat its speeds higher. Does anyone know why Python would generate write speeds so much higher than those generated by fio?

Solution

The reason your Python program does better than your fio job is because this is not a fair comparison and they are testing different things:

You banned fio from using Linux's buffer cache (by using buffered=0 which is the same as saying direct=1) by telling it to do O_DIRECT operations. With the job you specified, fio will have to send down a single 4k write and then wait for that write to complete at the device (and that acknowledgement has to get all the way back to fio) before it can send the next.
Your Python script is allowed to send down writes that can be buffered at multiple levels (e.g. within userspace by the C library and then again in the buffer cache of the kernel) before touching your SSD. This will generally mean the writes will be accumulated and merged together before being sent down to the lower level resulting in chunkier I/Os that have less overhead. Further, since you don't do any explicit flushing in theory no I/O has to be sent to the disk before your program exits (in practice this will depend on a number of factors like how much I/O you do, the amount of RAM Linux can set aside for buffers, the maximum time the filesystem will hold dirty data for, how long you do the I/O for etc)! Your os.close(write_file) will just be turned into an fclose() which says this in its Linux man page:

Note that fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).

In fact you take your final time before calling os.close(), so you may even be omitting the time it took for the final "batches" of data to be sent only to the kernel let alone the SSD!

Your Python script is closer to this fio job:

[global]
ioengine=psync
rw=write
bs=4k
size=1g

[file1]
filename=/fsmnt/fio.tmp

Even with this fio is still at a disadvantage because your Python program has userspace buffering (so bs=8k may be closer).

The key takeaway is your Python program is not really testing your SSD's speed at your specified block size and your original fio job is a bit weird, heavily restricted (the libaio ioengine is asynchronous but with a depth of 1 you're not going to be able to benefit from that and that's before we get to the behaviour of Linux AIO when using filesystems) and does different things to your Python program. if you're not doing significantly more buffered I/O compared to the size of the largest buffer (and on Linux the kernel's buffer size scales with RAM) and if the buffered I/Os are small the exercise turns into a demonstration of the effectiveness of buffering.