Search code examples
rrlang

Parsing a formula with rlang


I am trying to learn how to write a domain specific language in R with rlang. This is just a mini example to understand how parsing and operations work.

Say I have the following data:

> top <- seq(2,10,2)
> bottom <- rep(2,length(top))
> times <- rep(10,length(top))
> df <- tibble::tibble(top,bottom,times)
> df
    top bottom times
  <dbl>  <dbl> <dbl>
1  2.00   2.00  10.0
2  4.00   2.00  10.0
3  6.00   2.00  10.0
4  8.00   2.00  10.0
5  10.0   2.00  10.0

I would like a domain specific language that takes the following examples

1.

df_result1 <- divi(top | bottom ~ times, df)

2.

df_result2 <- divi(top | bottom ~ 1, df)

And produces the following:

1.

> df_result1
# A tibble: 5 x 4
    top bottom times result
  <dbl>  <dbl> <dbl>  <dbl>
1  2.00   2.00  10.0   10.0
2  4.00   2.00  10.0   20.0
3  6.00   2.00  10.0   30.0
4  8.00   2.00  10.0   40.0
5  10.0   2.00  10.0   50.0

2.

> df_result2
# A tibble: 1 x 1
  result
   <dbl>
1   3.00

In dplyr lingo the functions are:

1.

df_result1 <- df %>% mutate(result = (top/bottom)*times)

2.

df_result2 <- df %>% summarise(result = mean((top/bottom)))

Update

After some ad hoc work I came up with the following for one of the cases. It is probably technically ugly, but it gets the job done.

divi <- function(form, data){
  data %>% mutate(result=eval_tidy(f_lhs(f_lhs(form)))/
                      eval_tidy(f_rhs(f_lhs(form)))*
  eval_tidy(f_rhs(form)))
}

divi(top | bottom ~ times, df)

    top bottom times ressult
  <dbl>  <dbl> <dbl>   <dbl>
1     2      2    10      10
2     4      2    10      20
3     6      2    10      30
4     8      2    10      40
5    10      2    10      50

Solution

  • We have assumed that the general case here is that we want to replace | with / and then evaluate the left hand side taking its mean if the right hand side is 1 and multiplying by the right hand side and appending all that to data if not.

    This does not use rlang but seems pretty short. It breaks the formula up into left hand side, right hand side and environment (lhs, rhs, e) and evaluates the left hand side while replacing | with / giving eval_lhs. Then it checks whether the right hand side is 1 and if so it returns the mean of eval_lhs; otherwise, it appends eval_lhs times the evaluated right hand side to data and returns that.

    library(tibble)
    
    divi <- function(formula, data) {
       lhs <- formula[[2]]
       rhs <- formula[[3]]
       e <- environment(formula)
       eval_lhs <- eval(do.call("substitute", list(lhs, list("|" = `/`))), data, e)
       if (identical(rhs, 1)) tibble(result = mean(eval_lhs))
       else as.tibble(cbind(data, result = eval_lhs * eval(rhs, data, e)))
    }
    

    Now some test runs:

    divi(top | bottom ~ times, df)
    ## # A tibble: 5 x 4
    ##     top bottom times result
    ##   <dbl>  <dbl> <dbl>  <dbl>
    ## 1  2.00   2.00  10.0   10.0
    ## 2  4.00   2.00  10.0   20.0
    ## 3  6.00   2.00  10.0   30.0
    ## 4  8.00   2.00  10.0   40.0
    ## 5 10.0    2.00  10.0   50.0
    
    divi(top | bottom ~ 1, df)
    ## # A tibble: 1 x 1
    ##   result
    ##    <dbl>
    ## 1   3.00
    
    divi((top - bottom) | (top + bottom) ~ times^2, df)
    ## # A tibble: 5 x 4
    ##     top bottom times result
    ##   <dbl>  <dbl> <dbl>  <dbl>
    ## 1  2.00   2.00  10.0    0  
    ## 2  4.00   2.00  10.0   33.3
    ## 3  6.00   2.00  10.0   50.0
    ## 4  8.00   2.00  10.0   60.0
    ## 5 10.0    2.00  10.0   66.7
    

    If we are willing to restrict the input so that the only forms of input allowed are:

    variable | variable ~ variable
    variable | variable ~ 1
    

    and all variables are columns in the data and no variable can appear more than once in the formula then we could simplify it like this:

    divi0 <- function(formula, data) {
      d <- get_all_vars(formula, data)
      if (ncol(d) == 2) tibble(result = mean(d[[1]] / d[[2]]))
      else as.tibble(cbind(data, result = d[[1]] / d[[2]] * d[[3]]))
    }
    
    divi0(top | bottom ~ times, df)
    divi0(top | bottom | top ~ 1, df)
    

    This simplification only uses the number of and order of variables in the formula ignoring the operators so that, for example, these each give the same answer since they all list the same variables in the same order:

    divi0(top | bottom ~ times, df)
    divi0(~ top + bottom | times, df)
    divi0(~ top * bottom * times, df)