Search code examples
rlisttibblemap-functionmutate

A way to mutate a new column in every dataframe in a list of dataframes?


I have a list of files that I want to import as tibbles, then mutate a new column onto each tibble based on an ifelse statement for a threshold. I didn't want to use a loop, so I have been trying to make this work by using map() wrapped around read_csv %>% mutate() with no success. If possible, I would like to keep the pipes and/or do both the read csv and mutate steps within the same command.

Here are the two solutions I have tried:

This loop works, but I want to make it better.

filenames <- list.files(path = 'data-cleaned', full.names = TRUE)

for (x in 1:length(filenames)) {
  file <- read_csv(filenames[[x]]) %>%
    mutate(output_col = if_else(target_col >= threshold, "Positive", "Negative"))
  filename <- filenames[[x]]
  write_csv(file, filename)
}

This is my attempted improvement, but I have been getting errors every time I try to change it.

filenames <- list.files(path = 'data-cleaned', full.names = TRUE)

files <- filenames %>%
  map(read_csv %>%
  ~ .x %>% mutate(output_col = if_else(x$target_col >= threshold, "Positive", "Negative"))
   )

Error in `as_mapper()`:
! Can't convert `.f`, a two-sided formula, to a function.

In both of these cases, I am starting from the list of file names, but here is what I would expect the tibbles (simplified) to look like after being read in:

example1 <- tibble(id = 1:50, 
                  target_col = sample(1:2000, 50), 
                  other_col = sample(1:200, 50))

      id target_col other_col
   <int>      <int>     <int>
 1     1        860       118
 2     2        582       120
 3     3       1514       190
 4     4       1266        14
 5     5       1921       137
 6     6        804         1
 7     7        222       139
 8     8       1748        26
 9     9        472       107
10    10       1013       189

And this is what I would expect each tibble to look like after the mutation (except the Positive or Negative would be assigned based on if the target_col met the numeric threshold criteria):

example2 <- tibble(id = 1:50, 
                  target_col = sample(1:2000, 50), 
                  other_col = sample(1:200, 50),
                  output_col = sample(rep(c("Positive", "Negative"), each = 25))

     id  target_col other_col output_col
   <int>      <int>     <int> <chr>     
 1     1       1886       174 Negative  
 2     2       1794        55 Negative  
 3     3        472       178 Positive  
 4     4       1583       195 Negative  
 5     5       1037       110 Positive  
 6     6       1652       158 Negative  
 7     7        152        69 Negative  
 8     8       1396       113 Positive  
 9     9       1016       150 Positive  
10    10       1922       157 Negative  

I also want to mention I have tried looking at these existing questions and didn't find them helpful:

How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs

How to mutate multiple columns with dynamic variable using purrr:map function?

R using dplyr::mutate() within purrr::map without duplicating rows

Iterating over listed data frames within a piped purrr anonymous function call

Using purrr to iterate over two lists and then pipe into dplyr::filter across a list of data frames


Solution

  • Your use of the formula syntax ~ .x is syntactically incorrect - the formula should start with ~. Another x shows up later in the pipeline which isn't defined anywhere.

    The following is correct:

    files <- filenames %>%
      map(~ read_csv(.x) %>%
          mutate(output_col = if_else(target_col >= threshold, "Positive", "Negative")))
        )
    

    purrr only recommends the formula syntax if you require backward compatibility, I would personally use an anonymous function here:

    files <- filenames %>% map(\(x) read_csv(x) %>% ...)