Search code examples
rr-markdownrnotebook

Externalise config file and functions in R markdown


I am having problems understanding the (practical) difference between the different ways to externalise code in R notebooks. Having referred to previous questions or to the documentation, it is still unclear the difference in sourcing external .R files or read_chunk() them. For practical purposes let us consider the below:

  1. I want to load libraries with an external config.R file: the most intuitive way, according to me, seems to create config.R as

    library(first_package)
    library(second_package)
    ...
    

    and, in the general R notebook (say, main.Rmd) call it like

    ```{r}
    source('config.R')
    ```
    
    ```{r}
    # use the libraries included above
    ```
    

    However, this does not recognise the packages included, so it seems that sourcing an external config file is useless. Likewise using read_chunk() instead. Therefore the question is: How to include libraries at the top, so that they are recognised in the main markdown script?

  2. Say I want to define global functions externally, and then include them in the main notebook: along the same lines as above one would include them in an external foo.R file and include them in the main one.

Again, it seems that read_chunk() does not do the job, whereas source('foo.R') does, in this case; the documentation states that the former "only evaluates code, but does not execute it": when is it ever the case that one wants to only evaluate the code but not execute it? Differently posed: why would one ever use read_chunk() rather than source, for practical purposes?


Solution

    1. This does not recognise the packages included

      In your example, first_package and second_package are both available in the working environment for the second code chunk.

      Try putting library(nycflights13) in the R file and head(airlines) in the second chunk of the Rmd file. Calling knit("main.Rmd") would fail if the nycflights13 package wasn't successfully loaded with source.

    2. read_chunk does in fact accomplish this (along with source) however they go about it differently. With source you will have the global functions available directly after the source (as you have found). With read_chunk however, as you pointed out since it only evaluates code, but does not execute it you need to explicitly execute the chunk and then the function will be available. (See my example with third_config_chunk below. Including the empty chunk of third_config_chunk in the report allows the global some_function to be called in subsequent chunks.)

    Regarding "only evaluates code, but does not execute it", this is an entire property of R programming known as lazy evaluation. The idea being that you may want to create a number of functions or template code which is read into your R environment but is not executed on-the-spot, allowing you to modify the environment/parameters prior to evaluation. This also allows you to execute the same code chunks multiple times whereas source will only run once with what is already provided.

    Consider an example where you have an external R script which contains a large amount of setup code that isn't needed in your report. It is possible to format this file into many "chunks" which will be loaded into the working environment with read_chunk but won't be evaluated until explicitly told.

    In order to externalise your config.R using read_chunk() you would write the R script as:

    config.R

    # ---- config_preamble
    ## setup code that is required for config.R
    ## to run but not for main.Rmd
    
    # ---- first_config_chunk
    library(nycflights13)
    library(MASS)
    
    # ---- second_config_chunk
    y <- 1
    
    # ---- third_config_chunk
    some_function <- function(x) {
      x + y
    }
    
    # ---- fourth_config_chunk
    some_function(10)
    
    # ---- config_output
    ## code that is output during `source`
    ## and not wanted in main.Rmd
    print(some_function(10))
    

    To use this script with the externalisation methodology, you would setup main.Rmd as follows:

    main.Rmd

    ```{r, include=FALSE}
    knitr::read_chunk('config.R')
    ```
    
    ```{r first_config_chunk}
    ```
    
    The packages are now loaded.
    
    ```{r third_config_chunk}
    ```
    
    `some_function` is now available.
    
    ```{r new_chunk}
    y <- 20
    ```
    
    ```{r fourth_config_chunk}
    ```
    ## [1] 30
    
    ```{r new_chunk_two}
    y <- 100
    lapply(seq(3), some_function)
    ```
    ## [[1]]
    ## [1] 101
    ## 
    ## [[2]]
    ## [1] 102
    ## 
    ## [[3]]
    ## [1] 103
    
    ```{r source_file_instead}
    source("config.R")
    ```
    ## [1] 11
    

    As you can see, if you were to source this file, there would be no way to modify the call to some_function prior to execution and the call would print an output of "11". Now that the chunks are available in the environment, they can be re-called any number of times (after for example, changing the value of y) or used any other way in the current environment (eg. new_chunk_two) which would not be possible with source if you didn't want the rest of the R script to execute.