I regularly collaborate on large data analysis projects using git and statistical software such as R. Because the datasets are very large and may change upon re-download, we do not keep these in the repository. While we like to design the final versions of the scripts we develop to use command line arguments to read paths to the raw datasets, it's easier to test and debug by directly reading the files into the R environment. As we develop, therefore, we end up with lines such as
something = read.raw.file("path/to/file/on/my/machine")
#something = read.raw.file("path/to/file/on/collaborators/machine")
#something = read.raw.file("path/to/file/on/other/collaborators/machine")
cluttering up the code.
There must be a better way. I've tried adding a file that each script reads before running, such as
proj-config.local
path.to.raw.file.1 = "/path/to/file/on/my/machine"
and adding it to .gitignore
, but this is a "heavyweight" workaround given how much time it takes, and it's not obvious to collaborators that one is doing that or that they should, or they might name or locate the file differently (since it's ignored) so then the shared line of code that reads that file ends up wrong, etc. etc.
Is there a better way to manage local outside-repo paths/references?
PS I didn't notice anything addressing this issue in any of these related quetions:
A solution I've been using is to build in the concept of a search path which can be used to locate files. In one particular application, I've built-in the ability to override the search path with an environment variable, similar to the PATH
variable commonly used.
I wrote a function, findFileInPath
(below) that will search the supplied path and return any that are found. It takes in a path vector and allows you to separate pieces by a certain character like an OS typically does.
You could use it like this: (as an example only)
DataSearchPath = c(
"path/to/file/on/my/machine",
"path/to/file/on/collaborators/machine",
"path/to/file/on/other/collaborators/machine",
Sys.getenv('DATASEARCHPATH')
)
DataFilename = "data_file.csv"
DataPathname = findFileInPath(DataFilename, path=DataSearchPath)[1] # Take the first one
if (is.na(DataPathname)) {
stop(paste("Cannot find data file", DataFilename), call.=FALSE)
}
...
I use something like that to locate files to source
, to locate configuration files, data sets, etc. I have multiple different paths, some of them exposed in the environment or various configuration files, others are just internal. It works pretty well.
In the example above, the DATASEARCHPATH
environment variable can be set (outside of R) to a colon-separated series of paths to search.
My implementation of findFileInPath
defaults to searching the system's PATH environment variable, separated by the colon character. (This probably won't be applicable to Windows. I only use this on Mac and Linux.)
#' findFileInPath: Locates files by searching the supplied paths
#'
#' @param filename character: the name of the file to search for
#'
#' @param path character: the path to search, either a vector, or optionally
#' separated by \code{sep}.
#'
#' @param sep character: the separator character used to split \code{path}
#' into multiple components.
#'
findFileInPath = function(filename, path=c('.',Sys.getenv('PATH')), sep=':') {
# List all potential files, and return only those which exist.
files = data.frame(name=file.path(unlist(strsplit(path, sep)), filename),
stringsAsFactors=FALSE)
files$exist = file.exists(files$name)
files[files$exist==TRUE,1]
}