Search code examples
rdplyracross

How to use across and mutate across an entire dataset that has multiple column types?


I'm trying to use dplyr's across and case_when across my entire dataset, so whenever it sees "Strongly Agree" it changes it to a numeric 5, "Agree" to a numeric 4, and so on. I've tried looking at this answer, but I'm getting an error because my dataset has logical and numeric columns and R rightfully says that "Agree" can't be in a logical column, etc.

Here's my data:

library(dplyr)
test <- tibble(name = c("Justin", "Corey", "Sibley"),
               date = c("2021-08-09", "2021-10-29", "2021-01-01"),
               s1 = c("Agree", "Neutral", "Strongly Disagree"),
               s2rl = c("Agree", "Neutral", "Strongly Disagree"),
               f1 = c("Strongly Agree", "Disagree", "Strongly Disagree"),
               f2rl = c("Strongly Agree", "Disagree", "Strongly Disagree"),
               exam = c(90, 99, 100),
               early = c(TRUE, FALSE, FALSE))

Ideally, I'd like one command that would allow me to go across the entire dataset. However, if that can't be done, I'd like to have one argument that would allow me to use multiple across(contains()) arguments (i.e., here contains "s" or "f").

Here's what I've tried already to no avail:

library(dplyr)
test %>%
  mutate(across(.), 
         ~ case_when(. == "Strongly Agree" ~ 5, 
                     . == "Agree" ~ 4,
                     . == "Neutral" ~ 3,
                     . == "Disagree" ~ 2,
                     . == "Strongly Disagree" ~ 1,
                     TRUE ~ NA))

Error: Problem with `mutate()` input `..1`.
x Must subset columns with a valid subscript vector.
x Subscript has the wrong type `tbl_df<
  name: character
  date: character
  s1  : character
  s2rl: character
  f1  : character
  f2rl: character
  exam: double
>`.
ℹ It must be numeric or character.
ℹ Input `..1` is `across(.)`.

Solution

  • We can use matches to pass regex

    library(dplyr)
    test %>% 
        mutate(across(matches('^(s|f)'), ~ case_when(. == "Strongly Agree" ~ 5, 
                         . == "Agree" ~ 4,
                         . == "Neutral" ~ 3,
                         . == "Disagree" ~ 2,
                         . == "Strongly Disagree" ~ 1,
                         TRUE ~ NA_real_)))
    

    -output

    # A tibble: 3 x 8
      name   date          s1  s2rl    f1  f2rl  exam early
      <chr>  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
    1 Justin 2021-08-09     4     4     5     5    90 TRUE 
    2 Corey  2021-10-29     3     3     2     2    99 FALSE
    3 Sibley 2021-01-01     1     1     1     1   100 FALSE
    

    According to ?across

    across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in "data-masking" functions like summarise() and mutate().

    If we check the ?select, it returns with the various select-helpers used for selecting columns which can be used in across as well

    Tidyverse selections implement a dialect of R where operators make it easy to select variables:

    : for selecting a range of consecutive variables.

    ! for taking the complement of a set of variables.

    & and | for selecting the intersection or the union of two sets of variables.

    c() for combining selections.

    In addition, you can use selection helpers. Some helpers select specific columns:

    everything(): Matches all variables.

    last_col(): Select last variable, possibly with an offset.

    These helpers select variables by matching patterns in their names:

    starts_with(): Starts with a prefix.

    ends_with(): Ends with a suffix.

    contains(): Contains a literal string.

    matches(): Matches a regular expression.

    num_range(): Matches a numerical range like x01, x02, x03.

    These helpers select variables from a character vector:

    all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.

    any_of(): Same as all_of(), except that no error is thrown for names that don't exist.

    This helper selects variables with a function:

    where(): Applies a function to all variables and selects those for which the function returns TRUE.