Search code examples
rfunctional-programmingtidyverse

group_by causing issues when using custom tidyverse functions


After some blood sweat and tears I've cobbled together some lines of code that calculates the percent rank for each row value of x relative to y and some lagged values of y.

I need to now apply this at a group level however when I uncomment the group_by(grp) the code throws an error that column 'grp' is not found.

I believe the error is due to masking however after quiet some research on S.O. and the tidyverse documentation as well as wrapping variables with {{}} and !! I'm no wiser as to what the problem is.

I'm hoping a more seasoned tidyverse user can suggest a path forward to apply my calculate_percentile_rank function at the grp level. Thanks in advance.

set.seed(123)

pacman::p_load(tidyverse)

df <- data.frame(
  grp = sample(c("A","B"), size = 100, replace = TRUE),
  x = sample(1000, size = 100),
  y = sample(10000, size = 100)
)

# Function takes point var and calculates percent rank relative to value from lookback var using window size of lags
calculate_percentile_rank <- function(df, point_var, lookback_var, lags) {
  map_lag <- lags %>% map(~ partial(lag, n = .x))
  return(df %>%  mutate(across(.cols =  {{lookback_var}}, .fns = map_lag, .names = "{.col}_lag_{lags}")) %>%
    rowwise() %>%
    mutate(lag_vector = list(c_across(c({{point_var}},contains("_lag_"))))) %>%
    rowwise() %>%
    mutate(pct_rank = list(dplyr::percent_rank(lag_vector))) %>%
    as.data.frame() %>%
    mutate(pct_rank = map_dbl(pct_rank, first)) %>%
    select(pct_rank) %>%
    as.data.frame()
  )
}

test <- df %>% 
  #group_by(grp) %>% 
  mutate(pct_rank = calculate_percentile_rank(df = ., lookback_var = y, point_var = x,lags = 1:3))

Solution

  • I actually get a different error: "pct_rank must be size 57 or 1, not 100."

    1. mutate expects that the function returns a vector. Your version of calculate_percentile_rank returns a data.frame.
    2. The df argument is getting passed as the entire data.frame, where as with grouping, lookback_var and point_var are being passed in as just the rows for each of the group. You don't actually need the entire data frame to calculate your lags.

    Calculate your percentage rank on just the vectors:

    calculate_percentile_rank <- function(df, point_var, lookback_var, lags) {
      
      map(lags, \(.x) lag(lookback_var, n = .x)) |> # calculate lags
        c(list(point_var)) |> # collect all values as single list
        pmap(c) |> # create list of vectors for each "row"
        map(percent_rank) |> # calculate rank
        map_dbl(last) # because of order when collecting the rows, take last value
      
    }