Search code examples
rrandomparallel-processing

Safely write RDS files in parallel


I do parallel computation from data in R, with 2000 parallel jobs, I use the following code in each job to save the final results in RDS format:

timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
    timestamp <- paste0(timestamp, sample.int(1000, 1))
    filename <- paste0("differences_", timestamp, ".rds")
    saveRDS(differences, file = filename)

I realized that even though I added random four-digits to the time stamp, it might still be possible that the parallel jobs conflicting each other. A possible solution could be to check whether a file with the same name already existed, if so generate a new name:

while (file.exists(filename)) {
  timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
  timestamp <- paste0(timestamp, sample.int(1000, 1))
  filename <- paste0("differences_", timestamp, ".rds")
}

But I am not sure if this is entirely safe. Is there a way to ensure a unique file name for each result from the parallel jobs? I have to keep the files in the same path.


Solution

  • I'll offer this as a slightly-more-complex variant of @KonradRudolph's suggestion (pre-sequence filenames for each parallel task), which is far simpler and might be preferable in most situations.

    This suggestion is geared towards shared-filesystems where filenames cannot be easily pre-determined. The problems with shared-filesystems start from the basis that using existing files for uniqueness is a guaranteed race-condition theoretically, somewhat reproducibly bad on a local filesystem, and with NFS and other shared filesystems the lag can produce significant problems. So we need to produce a filename "in real time" that has so-very-low likelihood of collision. (In my example, I write to a Luster filesystem, where file-creation-lag can be upwards of 5-10 seconds across nodes on the HPC.)

    Here's a function that will produce such a filename:

    #' Create a "guaranteed-unique" filename
    #'
    #' @param path character, the directory in which the new file will be
    #'   created
    #' @param fileext optional character, appended to the new filename
    #' @param create logical, whether to "touch" the file
    #' @return character, the filename, optionally created
    #' @export
    unique_filename <- local({
      .host <- gsub("[/:]", "_", Sys.info()["nodename"])
      .count <- 0L
    
      function(path = character(0), fileext = "", create = FALSE) {
        now <- as.numeric(Sys.time())
        # if we put this up with `.host`, I suspect the `future` is
        # transferring the old PID to new processes
        .pid <- Sys.getpid()
        # we look within all subdirs to make sure we won't have a "future"
        # collision (though highly unlikely) and then discard the
        # subdirectory component
        filename <- sprintf(
          "%0.06f.P%i.Q%i.%s%s",
          now, .pid, .count, .host, fileext)
        .count <<- .count + 1L
        out <- if (length(path) && nzchar(path)) file.path(path, filename) else filename
        if (create) fs::file_touch(out)
        out
      }
    })
    

    It is based heavily on the file naming convention used in Maildir mail storage, where NFS-based file conflicts had to be avoided as inexpensively as possible. I do not use all of the suggested components, so this implementation is weakened slightly. In my use with thousands of concurrent writes, I have seen no collisions.

    It uses:

    • epoch microseconds
    • hostname
    • process ID (pid)
    • an internal counter for each process

    One side-effect of the filename starting with a "time" component is that the files naturally sort chronologically, if that's appealing. The use of random filenames (uuid or otherwise) does not do this as easily.