Search code examples
c++linuxioposix

Why is data corrupt when reading back from a file as it's being written with O_DIRECT


I have a C++ program that uses the POSIX API to write a file opened with O_DIRECT. Concurrently, another thread is reading back from the same file via a different file descriptor. I've noticed that occasionally the data read back from the file contains all zeroes, rather than the actual data I wrote. Why is this?

Here's an MCVE in C++17. Compile with g++ -std=c++17 -Wall -otest test.cpp or equivalent. Sorry I couldn't seem to make it any shorter. All it does is write 100 MiB of constant bytes (0x5A) to a file in one thread and read them back in another, printing a message if any of the read-back bytes are not equal to 0x5A.

WARNING, this MCVE will delete and rewrite any file in the current working directory named foo.

#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

constexpr size_t CHUNK_SIZE = 1024 * 1024;
constexpr size_t TOTAL_SIZE = 100 * CHUNK_SIZE;

int main(int argc, char *argv[])
{
    ::unlink("foo");

    std::thread write_thread([]()
    {
        int fd = ::open("foo", O_WRONLY | O_CREAT | O_DIRECT, 0777);
        if (fd < 0) std::exit(-1);

        uint8_t *buffer = static_cast<uint8_t *>(
            std::aligned_alloc(4096, CHUNK_SIZE));

        std::fill(buffer, buffer + CHUNK_SIZE, 0x5A);

        size_t written = 0;
        while (written < TOTAL_SIZE)
        {
            ssize_t rv = ::write(fd, buffer,
                std::min(TOTAL_SIZE - written, CHUNK_SIZE));
            if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }
            written += rv;
        }
    });

    std::thread read_thread([]()
    {
        int fd = ::open("foo", O_RDONLY, 0);
        if (fd < 0) std::exit(-1);

        uint8_t *buffer = new uint8_t[CHUNK_SIZE];

        size_t checked = 0;
        while (checked < TOTAL_SIZE)
        {
            ssize_t rv = ::read(fd, buffer, CHUNK_SIZE);
            if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }

            for (ssize_t i = 0; i < rv; ++i)
                if (buffer[i] != 0x5A)
                    std::cerr << "readback mismatch at offset " << checked + i << std::endl;

            checked += rv;
        }
    });

    write_thread.join();
    read_thread.join();
}

(Details such as proper error checking and resource management are omitted here for the sake of the MCVE. This is not my actual program but it shows the same behavior.)

I'm testing on Linux 4.15.0 with an SSD. About 1/3 of the time I run the program, the "readback mismatch" message prints. Sometimes it doesn't. In all cases, if I examine foo after the fact I find that it does contain the correct data.

If you remove O_DIRECT from the ::open() flags in the write thread, the problem goes away and the "readback mismatch" message never prints.

I could understand why my ::read() might return 0 or something to indicate I've already read everything that has been flushed to disk yet. But I can't understand why it would perform what appears to be a successful read, but with data other than what I wrote. Clearly I'm missing something, but what is it?


Solution

  • So, O_DIRECT has some additional constraints that might not make it what you're looking for:

    Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone.

    Instead, I think O_SYNC might be better, since it does provide the expected guarantees:

    O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don't need the guarantees of file integrity completion.