Search code examples
rmakefile

Do .Rout files preserve the R working environment?


I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.

Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:

# Key variables to define
RDIR = .

# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
    R CMD BATCH $<

None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?

To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?

If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.

I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.


Solution

  • By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.

    In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.

    You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.