Search code examples
c++undefined-behaviorreinterpret-castmemory-mapping

Dealing with undefined behavior when using reinterpret_cast in a memory mapping


To avoid copying large amounts of data, it is desirable to mmap a binary file and process the raw data directly. This approach has several advantages, including relegating the paging to the operating system. Unfortunately, it is my understanding that the obvious implementation leads to Undefined Behavior (UB).

My use case is as follows: Create a binary file that contains some header identifying the format and providing metadata (in this case simply the number of double values). The remainder of the file contains raw binary values which I wish to process without having to first copy the file into a local buffer (that's why I'm memory-mapping the file in the first place). The program below is a full (if simple) example (I believe that all places marked as UB[X] lead to UB):

// C++ Standard Library
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <numeric>

// POSIX Library (for mmap)
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

constexpr char MAGIC[8] = {"1234567"};

struct Header {
  char          magic[sizeof(MAGIC)] = {'\0'};
  std::uint64_t size                 = {0};
};
static_assert(sizeof(Header) == 16, "Header size should be 16 bytes");
static_assert(alignof(Header) == 8, "Header alignment should be 8 bytes");

void write_binary_data(const char* filename) {
  Header header;
  std::copy_n(MAGIC, sizeof(MAGIC), header.magic);
  header.size = 100u;

  std::ofstream fp(filename, std::ios::out | std::ios::binary);
  fp.write(reinterpret_cast<const char*>(&header), sizeof(Header));
  for (auto k = 0u; k < header.size; ++k) {
    double value = static_cast<double>(k);
    fp.write(reinterpret_cast<const char*>(&value), sizeof(double));
  }
}

double read_binary_data(const char* filename) {
  // POSIX mmap API
  auto        fp = ::open(filename, O_RDONLY);
  struct stat sb;
  ::fstat(fp, &sb);
  auto data = static_cast<char*>(
      ::mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fp, 0));
  ::close(fp);
  // end of POSIX mmap API (all error handling ommitted)

  // UB1
  const auto header = reinterpret_cast<const Header*>(data);

  // UB2
  if (!std::equal(MAGIC, MAGIC + sizeof(MAGIC), header->magic)) {
    throw std::runtime_error("Magic word mismatch");
  }

  // UB3
  auto beg = reinterpret_cast<const double*>(data + sizeof(Header));

  // UB4
  auto end = std::next(beg, header->size);

  // UB5
  auto sum = std::accumulate(beg, end, double{0});

  ::munmap(data, sb.st_size);

  return sum;
}

int main() {
  const double expected = 4950.0;
  write_binary_data("test-data.bin");

  if (auto sum = read_binary_data("test-data.bin"); sum == expected) {
    std::cout << "as expected, sum is: " << sum << "\n";
  } else {
    std::cout << "error\n";
  }
}

Compile and run as:

$ clang++ example.cpp -std=c++17 -Wall -Wextra -O3 -march=native
$ ./a.out
$ as expected, sum is: 4950

In real life, the actual binary format is much more complicated but retains the same properties: Fundamental types stored in a binary file with proper alignment.

My question is: how do you deal with this use case?

I have found many answers that I perceive as conflicting.

Some answers state unequivocally that one should build the objects locally. This may very well be the case but severely complicates any array-oriented operations.

Comments elsewhere seem to agree on the UB nature of this construct but there are some disagreements.

The wording in cppreference is, at least to me, confusing. I would have interpreted it as "what I'm doing is perfectly legal". Specifically this paragraph:

Whenever an attempt is made to read or modify the stored value of an object of type DynamicType through a glvalue of type AliasedType, the behavior is undefined unless one of the following is true:

  • AliasedType and DynamicType are similar.
  • AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType.
  • AliasedType is std::byte, (since C++17)char, or unsigned char: this permits examination of the object representation of any object as an array of bytes.

It may be that C++17 offers some hope with std::launder or that I'll have to wait until C++20 for something along the lines of std::bit_cast.

In the mean time, how do you deal with this issue?

Link to on-line demo: https://onlinegdb.com/rk_xnlRUV

Simplified example in C

It is my understanding correct that the following C program does not exhibit Undefined Behavior? I understand that the pointer casting through a char buffer does not participate in the strict aliasing rules.

#include <stdint.h>
#include <stdio.h>

struct Header {
  char     magic[8];
  uint64_t size;
};

static void process(const char* buffer) {
  const struct Header* h = (const struct Header*)(buffer);
  printf("reading %llu values from buffer\n", h->size);
}

int main(int argc, char* argv[]) {
  if (argc != 2) {
    return 1;
  }
  // In practice, I'd pass the buffer through mmap
  FILE* fp = fopen(argv[1], "rb");
  char  buffer[sizeof(struct Header)];
  fread(buffer, sizeof(struct Header), 1, fp);
  fclose(fp);
  process(buffer);
}

I can compile and run this C code by passing the file created by the original, C++ program and works as expected:

$ clang struct.c -std=c11 -Wall -Wextra -O3 -march=native
$ ./a.out test-data.bin 
reading 100 values from buffer

Solution

  • std::launder solves the problem with strict aliasing, but not with object lifetime.

    std::bit_cast makes a copy (it's basically a wrapper for std::memcpy) and doesn't work with copying from a range of bytes.

    There is no tool in standard C++ to reinterpret mapped memory without copying. Such tool has been proposed: std::bless. Until/unless such changes are adopted into the standard, you'll have to either hope that UB doesn't break anything, take the potential†† performance hit and copy, or write the program in C.

    While not ideal, this is not necessarily as bad as it sounds. You're already restricting portability by using mmap, and if your target system / compiler promises that it is OK to reinterpret mmapped memory (perhaps with laundering), then there should be no problem. That said, I don't know if say, GCC on Linux gives such guarantee.

    †† The compiler may optimise std::memcpy away. There might not be any performance hit involved. There's a handy function in this SO answer which was observed to be optimised away, but does initiate object lifetime following the language rules. It does have a limitation the mapped memory must be writable (as it creates objects in the memory, and in non-optimised build it might do an actual copy).