Search code examples
rgitversion-controlpathcollaboration

What is a way to manage paths to local files outside a git repository without clutter from conflicts from differing paths on collaborators' machines?


I regularly collaborate on large data analysis projects using git and statistical software such as R. Because the datasets are very large and may change upon re-download, we do not keep these in the repository. While we like to design the final versions of the scripts we develop to use command line arguments to read paths to the raw datasets, it's easier to test and debug by directly reading the files into the R environment. As we develop, therefore, we end up with lines such as

something = read.raw.file("path/to/file/on/my/machine")
#something = read.raw.file("path/to/file/on/collaborators/machine")
#something = read.raw.file("path/to/file/on/other/collaborators/machine")

cluttering up the code.

There must be a better way. I've tried adding a file that each script reads before running, such as

proj-config.local
    path.to.raw.file.1 = "/path/to/file/on/my/machine"

and adding it to .gitignore, but this is a "heavyweight" workaround given how much time it takes, and it's not obvious to collaborators that one is doing that or that they should, or they might name or locate the file differently (since it's ignored) so then the shared line of code that reads that file ends up wrong, etc. etc.

Is there a better way to manage local outside-repo paths/references?

PS I didn't notice anything addressing this issue in any of these related quetions:

  1. Workflow for statistical analysis and report writing
  2. project organization with R
  3. What best practices do you use for programming in R?
  4. How do you combine "Revision Control" with "Workflow" for R?
  5. How does software development compare with statistical programming/analysis?
  6. Essential skills of a Data Scientist
  7. Ensuring reproducibility in an R environment
  8. R and version control for the solo data analyst

Solution

  • A solution I've been using is to build in the concept of a search path which can be used to locate files. In one particular application, I've built-in the ability to override the search path with an environment variable, similar to the PATH variable commonly used.

    I wrote a function, findFileInPath (below) that will search the supplied path and return any that are found. It takes in a path vector and allows you to separate pieces by a certain character like an OS typically does.

    You could use it like this: (as an example only)

    DataSearchPath = c(
        "path/to/file/on/my/machine",
        "path/to/file/on/collaborators/machine",
        "path/to/file/on/other/collaborators/machine",
        Sys.getenv('DATASEARCHPATH')
    )
    
    DataFilename = "data_file.csv"
    DataPathname = findFileInPath(DataFilename, path=DataSearchPath)[1] # Take the first one
    
    if (is.na(DataPathname)) {
        stop(paste("Cannot find data file", DataFilename), call.=FALSE)
    }
    
    ...
    

    I use something like that to locate files to source, to locate configuration files, data sets, etc. I have multiple different paths, some of them exposed in the environment or various configuration files, others are just internal. It works pretty well.

    In the example above, the DATASEARCHPATH environment variable can be set (outside of R) to a colon-separated series of paths to search.

    My implementation of findFileInPath defaults to searching the system's PATH environment variable, separated by the colon character. (This probably won't be applicable to Windows. I only use this on Mac and Linux.)

    #' findFileInPath: Locates files by searching the supplied paths
    #'
    #' @param filename character: the name of the file to search for
    #'
    #' @param path character: the path to search, either a vector, or optionally
    #'   separated by \code{sep}.
    #'
    #' @param sep character: the separator character used to split \code{path}
    #'   into multiple components.
    #'
    findFileInPath = function(filename, path=c('.',Sys.getenv('PATH')), sep=':') {
    
        # List all potential files, and return only those which exist.
        files = data.frame(name=file.path(unlist(strsplit(path, sep)), filename),
                           stringsAsFactors=FALSE)
        files$exist = file.exists(files$name)
        files[files$exist==TRUE,1]
    }