Search code examples
rr-markdownknitrbookdown

Run R Markdown on many different datasets and save each knitted word document separately


I created an R Markdown to check for errors in a series of datasets (e.g., are there any blanks in a given column? If so, then print a statement that there are NAs and which rows have the NAs). I have setup the R Markdown to output a bookdown::word_document2. I have about 100 datasets that I need to run this same R Markdown on and get a word document output for each separately.

Is there a way to run this same R Markdown across all of the datasets and get a new word document for each (and so they are not overwritten)? All the datasets are in the same directory. I know that the output is overwritten each time you knit the document; thus, I need to be able to save each word document according to the dataset/file name.

Minimal Example

Create a Directory with 3 .xlsx Files

library(openxlsx)

setwd("~/Desktop")
dir.create("data")

dataset <-
  structure(
    list(
      name = c("Andrew", "Max", "Sylvia", NA, "1"),
      number = c(1, 2, 2, NA, NA),
      category = c("cool", "amazing",
                   "wonderful", "okay", NA)
    ),
    class = "data.frame",
    row.names = c(NA,-5L)
  )

write.xlsx(dataset, './data/test.xlsx')
write.xlsx(dataset, './data/dataset.xlsx')
write.xlsx(dataset, './data/another.xlsx')

RMarkdown

---
title: Hello_World
author: "Somebody"
output:
  bookdown::word_document2:
    fig_caption: yes
    number_sections: FALSE

---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

setwd("~/Desktop")

library(openxlsx)

# Load data for one .xlsx file. The other datasets are all in "/data". 
dataset <- openxlsx::read.xlsx("./data/test.xlsx")

```    

# Test for Errors

```{r test, echo=FALSE, comment=NA}

# Are there any NA values in the column?
suppressWarnings(if (TRUE %in% is.na(dataset$name)) {
  na.index <- which(is.na(dataset$name))
  cat(
    paste(
      "– There are NAs/blanks in the name column. There should be no blanks in this column. The following row numbers in this column need to be corrected:",
      paste(na.index, collapse = ', ')
    ),
    ".",
    sep = "",
    "\n",
    "\n"
  )
})

```

So, I would run this R Markdown with the first .xlsx dataset (test.xlsx) in the /data directory, and save the word document. Then, I would want to do this for every other dataset listed in the directory (i.e., list.files(path = "./data") and save a new word document. So, the only thing that would change in each RMarkdown would be this line: dataset <- openxlsx::read.xlsx("./data/test.xlsx"). I know that I need to set up some parameters, which I can use in rmarkdown::render, but unsure how to do it.

I have looked at some other SO entries (e.g., How to combine two RMarkdown (.Rmd) files into a single output? or Is there a way to generate a cached version of an RMarkdown document and then generate multiple outputs directly from the cache?), but most focus on combining .Rmd files, and not running different iterations of the same file. I've also looked at Passing Parameters to R Markdown.

I have also tried the following from this. Here, all the additions were added to the example R Markdown above.

Added this to the YAML header:

params:
  directory:
    value: x

Added this to the setup code chunk:

# Pull in the data
dataset <- openxlsx::read.xlsx(file.path(params$directory))

Then, finally I run the following code to render the document.

rmarkdown::render(
    input  = 'Hello_World.Rmd'
    , params = list(
        directory = "./data"
    )
)

However, I get the following error, although I only have .xlsx files in /data:

Quitting from lines 14-24 (Hello_World.Rmd) Error: openxlsx can only read .xlsx files

I also tried this on my full .Rmd file and got the following error, although the paths are exactly the same.

Quitting from lines 14-24 (Hello_World.Rmd) Error in file(con, "rb") : cannot open the connection

*Note: Lines 14–24 are essentially the setup section of the .Rmd.

I'm unsure of what I need to change. I also need to generate multiple output files, using the original filename (like "test" from test.xlsx, "another" from another.xlsx, etc.)


Solution

  • You could call render in a loop to process each file passed as a parameter :

    dir_in <- 'data'
    dir_out <- 'result'
    
    files <- file.path(getwd(),dir_in,list.files(dir_in))
    
    for (file in files) {
      print(file)
      rmarkdown::render(
        input  = 'Hello_World.Rmd',
        output_file = tools::file_path_sans_ext(basename(file)),
        output_dir = dir_out,
        params = list(file = file)
      )
    }
    

    Rmarkdown :

    ---
    title: Hello_World
    author: "Somebody"
    output:
      bookdown::word_document2:
        fig_caption: yes
        number_sections: FALSE
    params: 
      file: ""
    ---
    
    ```{r setup, include=FALSE}
    knitr::opts_chunk$set(echo = TRUE)
    
    library(openxlsx)
    
    # Load data for one .xlsx file. The other datasets are all in "/data". 
    dataset <- openxlsx::read.xlsx(file)
    
    ```    
    
    # Test for Errors
    
    ```{r test, echo=FALSE, comment=NA}
    
    # Are there any NA values in the column?
    suppressWarnings(if (TRUE %in% is.na(dataset$name)) {
      na.index <- which(is.na(dataset$name))
      cat(
        paste(
          "– There are NAs/blanks in the name column. There should be no blanks in this column. The following row numbers in this column need to be corrected:",
          paste(na.index, collapse = ', ')
        ),
        ".",
        sep = "",
        "\n",
        "\n"
      )
    })
    
    ```