Search code examples
rdplyrnon-standard-evaluation

How to get dplyr::mutate() to work with variable names when called inside a function?


I am exploring data from the Pokemon API (not actually using the API, just pulling the .csv files from the github). In a file that contains the types of every Pokemon in narrow format (a Pokemon can have up to two types) called pokemon_types.csv, the types are encoded as integers (essentially factors). I want to label these levels by using a lookup table (types.csv), also from the API, that contains the levels as an id (1, 2, 3, etc.) and a corresponding identifier (normal, fighting, flying, etc.) which I want to use as the label.

> head(read_csv(path("pokemon_types.csv")), 10)
# A tibble: 10 x 3
   pokemon_id type_id  slot
        <dbl>   <dbl> <dbl>
 1          1      12     1
 2          1       4     2
 3          2      12     1
 4          2       4     2
 5          3      12     1
 6          3       4     2
 7          4      10     1
 8          5      10     1
 9          6      10     1
10          6       3     2
> head(read_csv(path("types.csv")))
# A tibble: 6 x 4
     id identifier generation_id damage_class_id
  <dbl> <chr>              <dbl>           <dbl>
1     1 normal                 1               2
2     2 fighting               1               2
3     3 flying                 1               2
4     4 poison                 1               2
5     5 ground                 1               2
6     6 rock                   1               2

My code works when I pipe all of the steps individually, but since I am going to perform this labeling step at least a dozen times or so I tried to put it into a function. The problem is that when I call the function instead (which has exactly the same steps as far as I can tell) it throws an object not found error.

The Setup:

library(readr)
library(magrittr)
library(dplyr)
library(tidyr)

options(readr.num_columns = 0)

# Append web directory to filename
path <- function(x) {
  paste0("https://raw.githubusercontent.com/",
         "PokeAPI/pokeapi/master/data/v2/csv/", x)
}

The offending function:

# Use lookup table to label factor variables
label <- function(data, variable, lookup) {
  mutate(data, variable = factor(variable, 
                                 levels = read_csv(path(lookup))$id,
                                 labels = read_csv(path(lookup))$identifier))
}

This version, which doesn't use the function, works:

df.types <-
  read_csv(path("pokemon_types.csv")) %>%
  mutate(type_id = factor(type_id, 
                          levels = read_csv(path("types.csv"))$id,
                          labels = read_csv(path("types.csv"))$identifier)) %>%
  spread(slot, type_id)

head(df.types)

it returns:

# A tibble: 6 x 3
  pokemon_id `1`   `2`   
       <dbl> <fct> <fct> 
1          1 grass poison
2          2 grass poison
3          3 grass poison
4          4 fire  NA    
5          5 fire  NA    
6          6 fire  flying

This version, which uses the function, does not:

df.types <-
  read_csv(path("pokemon_types.csv")) %>%
  label(type_id, "types.csv") %>%
  spread(slot, type_id)

it returns:

Error in factor(variable, 
                levels = read_csv(path(lookup))$id, 
                labels = read_csv(path(lookup))$identifier) : 
  object 'type_id' not found 

I know that there are several things that may be sub-optimal here (downloading lookup twice each time for instance) but I am more interested in why a function that seems identical to some written code makes it not work anymore. I am sure I am just making a silly mistake.


Solution

  • Thanks to the helpful comments I was able to learn all about non-standard evaluation and figure out a solution:

    label <- function(data, variable, lookup) {
      variable <- enquo(variable)
      data %>%
        mutate(!!variable := factor(!!variable, 
                                     levels = read_csv(path(lookup))$id,
                                     labels = read_csv(path(lookup))$identifier))
    }
    

    The key features are enquo(), which acts as a "quasiquote", !!, which "unquotes" the variable so it can be interpreted through the argument, and :=, which allows for unquoting on the both sides.

    I tried and failed to implement a solution that avoided dplyr entirely, but at least this works.