I do parallel computation from data in R, with 2000 parallel jobs, I use the following code in each job to save the final results in RDS format:
timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
timestamp <- paste0(timestamp, sample.int(1000, 1))
filename <- paste0("differences_", timestamp, ".rds")
saveRDS(differences, file = filename)
I realized that even though I added random four-digits to the time stamp, it might still be possible that the parallel jobs conflicting each other. A possible solution could be to check whether a file with the same name already existed, if so generate a new name:
while (file.exists(filename)) {
timestamp <- format(Sys.time(), "%Y%m%d%H%M%S")
timestamp <- paste0(timestamp, sample.int(1000, 1))
filename <- paste0("differences_", timestamp, ".rds")
}
But I am not sure if this is entirely safe. Is there a way to ensure a unique file name for each result from the parallel jobs? I have to keep the files in the same path.
I'll offer this as a slightly-more-complex variant of @KonradRudolph's suggestion (pre-sequence filenames for each parallel task), which is far simpler and might be preferable in most situations.
This suggestion is geared towards shared-filesystems where filenames cannot be easily pre-determined. The problems with shared-filesystems start from the basis that using existing files for uniqueness is a guaranteed race-condition theoretically, somewhat reproducibly bad on a local filesystem, and with NFS and other shared filesystems the lag can produce significant problems. So we need to produce a filename "in real time" that has so-very-low likelihood of collision. (In my example, I write to a Luster filesystem, where file-creation-lag can be upwards of 5-10 seconds across nodes on the HPC.)
Here's a function that will produce such a filename:
#' Create a "guaranteed-unique" filename
#'
#' @param path character, the directory in which the new file will be
#' created
#' @param fileext optional character, appended to the new filename
#' @param create logical, whether to "touch" the file
#' @return character, the filename, optionally created
#' @export
unique_filename <- local({
.host <- gsub("[/:]", "_", Sys.info()["nodename"])
.count <- 0L
function(path = character(0), fileext = "", create = FALSE) {
now <- as.numeric(Sys.time())
# if we put this up with `.host`, I suspect the `future` is
# transferring the old PID to new processes
.pid <- Sys.getpid()
# we look within all subdirs to make sure we won't have a "future"
# collision (though highly unlikely) and then discard the
# subdirectory component
filename <- sprintf(
"%0.06f.P%i.Q%i.%s%s",
now, .pid, .count, .host, fileext)
.count <<- .count + 1L
out <- if (length(path) && nzchar(path)) file.path(path, filename) else filename
if (create) fs::file_touch(out)
out
}
})
It is based heavily on the file naming convention used in Maildir mail storage, where NFS-based file conflicts had to be avoided as inexpensively as possible. I do not use all of the suggested components, so this implementation is weakened slightly. In my use with thousands of concurrent writes, I have seen no collisions.
It uses:
One side-effect of the filename starting with a "time" component is that the files naturally sort chronologically, if that's appealing. The use of random filenames (uuid or otherwise) does not do this as easily.