Why O_DIRECT is slower than normal read?

Here's the code I'm using:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <malloc.h>

int main (int argc, char* argv[]) {
    int fd;
    int alignment = 4096;
    int bufsize = 4096 * 4096;
    char* buf = (char*) memalign(alignment, bufsize);
    int i, n, result=0;
    const char* fname = "1GB.txt";

    if ((fd = open(fname, O_RDONLY|O_DIRECT)) < 0) {
        printf("%s: cannot open %s\n", fname);
        exit(2);
    }

    while ( (n = read(fd,buf,bufsize)) > 0 )
        for (i=0; i<n; ++i)
            result += buf[i];
    
    printf("Result: %d\n", result);

    return 0;
}

Here's the command I'm running:

echo 1 > /proc/sys/vm/drop_caches
time ./a.out 1GB.txt

Without O_DIRECT and after flushing page cache it takes only 1.1 seconds, with O_DIRECT it takes 2.5 seconds.

I tried changing the alignment and bufsize. Increasing the bufsize to 4096 * 4096 * 4 reduced the running time to 1.79 seconds. Increasing bufsize to 4096 * 4096 * 64 reduced running time to 1.75 seconds. Reducing the alignment to 512 reduced the running time to 1.72 seconds. I don't know what else to try.

I don't understand why using O_DIRECT makes the code slower. Could it be due to the fact that I'm using disk encryption?

I'm on Debian 12 kernel 6.1.0-9-amd64

UPDATE: Follow up: Why O_DIRECT is slower than plain read() even with read-ahead?

Solution

I think Linus summarizes O_DIRECT pretty well in this old mailing list thread, where someone was experiencing the same problem you are:

On Fri, 10 May 2002, Lincoln Dale wrote:

so O_DIRECT in 2.4.18 still shows up as a 55% performance hit versus no O_DIRECT. anyone have any clues?

Yes.

O_DIRECT isn't doing any read-ahead.

For O_DIRECT to be a win, you need to make it asynchronous.

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances [*].

It's simply not very pretty, and it doesn't perform very well either because of the bad interfaces (where synchronicity of read/write is part of it, but the inherent page-table-walking is another issue).

I bet you could get better performance more cleanly by splitting up the actual IO generation and the "user-space mapping" thing sanely.

So you're experiencing slower read operations because no read-ahead nor caching is being performed, which is the normal behavior without O_DIRECT.

Unless you want to request reading a much larger size, if you do chunked reads, you can really only benefit from O_DIRECT if you are implementing asynchronous operations, for example using io_uring. Other interesting solutions are also suggested by Linus in the mailing list thread linked above.