Safely write RDS files in parallel

I do parallel computation from data in R, with 2000 parallel jobs, I use the following code in each job to save the final results in RDS format:

timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
    timestamp <- paste0(timestamp, sample.int(1000, 1))
    filename <- paste0("differences_", timestamp, ".rds")
    saveRDS(differences, file = filename)

I realized that even though I added random four-digits to the time stamp, it might still be possible that the parallel jobs conflicting each other. A possible solution could be to check whether a file with the same name already existed, if so generate a new name:

while (file.exists(filename)) {
  timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
  timestamp <- paste0(timestamp, sample.int(1000, 1))
  filename <- paste0("differences_", timestamp, ".rds")
}

But I am not sure if this is entirely safe. Is there a way to ensure a unique file name for each result from the parallel jobs? I have to keep the files in the same path.

Solution

I'll offer this as a slightly-more-complex variant of @KonradRudolph's suggestion (pre-sequence filenames for each parallel task), which is far simpler and might be preferable in most situations.

This suggestion is geared towards shared-filesystems where filenames cannot be easily pre-determined. The problems with shared-filesystems start from the basis that using existing files for uniqueness is a guaranteed race-condition theoretically, somewhat reproducibly bad on a local filesystem, and with NFS and other shared filesystems the lag can produce significant problems. So we need to produce a filename "in real time" that has so-very-low likelihood of collision. (In my example, I write to a Luster filesystem, where file-creation-lag can be upwards of 5-10 seconds across nodes on the HPC.)

Here's a function that will produce such a filename:

#' Create a "guaranteed-unique" filename
#'
#' @param path character, the directory in which the new file will be
#'   created
#' @param fileext optional character, appended to the new filename
#' @param create logical, whether to "touch" the file
#' @return character, the filename, optionally created
#' @export
unique_filename <- local({
  .host <- gsub("[/:]", "_", Sys.info()["nodename"])
  .count <- 0L

  function(path = character(0), fileext = "", create = FALSE) {
    now <- as.numeric(Sys.time())
    # if we put this up with `.host`, I suspect the `future` is
    # transferring the old PID to new processes
    .pid <- Sys.getpid()
    # we look within all subdirs to make sure we won't have a "future"
    # collision (though highly unlikely) and then discard the
    # subdirectory component
    filename <- sprintf(
      "%0.06f.P%i.Q%i.%s%s",
      now, .pid, .count, .host, fileext)
    .count <<- .count + 1L
    out <- if (length(path) && nzchar(path)) file.path(path, filename) else filename
    if (create) fs::file_touch(out)
    out
  }
})

It is based heavily on the file naming convention used in Maildir mail storage, where NFS-based file conflicts had to be avoided as inexpensively as possible. I do not use all of the suggested components, so this implementation is weakened slightly. In my use with thousands of concurrent writes, I have seen no collisions.

It uses:

epoch microseconds
hostname
process ID (pid)
an internal counter for each process

One side-effect of the filename starting with a "time" component is that the files naturally sort chronologically, if that's appealing. The use of random filenames (uuid or otherwise) does not do this as easily.