Search code examples
rscopedata.tablenon-standard-evaluation

Understanding scope with non-standard evaluations in R data.table


How can I ensure that my nonstandard evaluation1 using data.table is inheriting the variables it needs from the parent frame?

Based on my understanding of dynamic scope, my code below should work, but it doesn't. What am I doing wrong?

Details

I have a list of many functions I want to apply to a single data.table that return boolean checks and messages (for when the check is TRUE). For example, let's say I am auditing a table of accounts.

library(data.table)
#----- Example data -----------------------------------------------------------
n <- 100
set.seed(123)
df <- data.table( acct_id      = paste0('ID',seq(n)),
                  acct_balance = round(pmax(rnorm(n,1000,5000),0)),
                  days_overdue = round(pmax(rnorm(n,20,20),0))
                  )
#----- Example list of rules to check (real case has more elements)------------
AuditRules <- list(
  list(
    msg_id = 1,
    msg_cat = 'Balance',
    cond_fn = function(d) d[, acct_balance > balance_limit ],
    msg_txt = 
      function(d) d[, paste('Account',acct_id,'balance is',
                            acct_balance - balance_limit, 
                            'over the limit.')]
  ),
  list(
    msg_id = 2,
    msg_cat = 'Overdue',
    cond_fn = function(d) d[, days_overdue > grace_period ],
    msg_txt = 
      function(d) d[, paste('Account',acct_id,'is overdue',
                            days_overdue-grace_period,
                            'days beyond grace period.')]
  )
)

I am looping through the list of rules and checking the dataset on each.

Desired Output

This works fine in the global environment.

balance_limit <- 1e4
grace_period  <-  14
audit <- rbindlist(
              lapply(AuditRules, function(item){
                with( item,
                      df[ cond_fn(df),
                         .(msg_id, 
                           msg_cat,
                           msg_txt = msg_txt(.SD) )
                         ]
                      )
                } )
            )
print(head(audit), row.names=FALSE)
#-----------------   Result   --------------------------------------
# msg_id msg_cat                                             msg_txt
#      1 Balance        Account ID44 balance is 1845 over the limit.
#      1 Balance        Account ID70 balance is 1250 over the limit.
#      1 Balance        Account ID97 balance is 1937 over the limit.
#      2 Overdue Account ID2 is overdue 11 days beyond grace period.
#      2 Overdue  Account ID3 is overdue 1 days beyond grace period.
#      2 Overdue  Account ID6 is overdue 5 days beyond grace period.

What doesn't work (and needs a solution)

rm(balance_limit, grace_period) # see "aside"

auditTheData <- function(d, balance_limit = 1e4, grace_period=14){
  rbindlist(
    lapply(AuditRules, function(item){
        with( item,
              d[ cond_fn(d),
                  .(msg_id, 
                    msg_cat,
                    msg_txt = msg_txt(.SD) )
                  ]
        )
    } )
  )
}
auditTheData(df)

results in the error:

Error in eval(jsub, SDenv, parent.frame()) : 
  object 'balance_limit' not found

It's not a problem with with(), although I've read (?with) that typically one should refrain from using it for programming. This also doesn't work:

auditTheData2 <- function(d, balance_limit = 1e4, grace_period=14){
  rbindlist(
    lapply(AuditRules, function(item){
          d[ item[['cond_fn']](d),
             .(msg_id, 
               msg_cat,
               msg_txt = item[['msg_txt']](.SD) )
             ]
    } )
  )
}
auditTheData2(df) # Same error

Aside: if you don't do rm(balance_limit, grace_period) before the "what doesn't work" function -- i.e. leave them in the global environment -- you get the desired results. So it seems like the function(item) that is getting lapply-ed can "see" into the global environment but not the parent environment (AuditTheData).


1I'm using "non-standard" in the unscientific sense of "unusual" here. Idk what counts as non-standard, but that's another (and a too broad?) question.


Solution

  • This seems to work:

    ar <- list(
      list(
        cat = 'Balance',
        cond_expr = quote(acct_balance > balance_limit),
        msg_expr = quote(sprintf('Account %s balance is %s over the limit.',
          acct_id, 
          acct_balance - balance_limit))
      ),
      list(
        cat = 'Overdue',
        cond_expr = quote(days_overdue > grace_period),
        msg_expr = quote(sprintf('Account %s is overdue %s days beyond grace period.', 
          acct_id, 
          days_overdue-grace_period))
      )
    )
    
    audDT = rbindlist(rapply(ar, list, "call", how = "replace"), id="msg_id")
    
    auditem = function(d, a, balance_limit = 1e4, grace_period = 14){
        a[, {
            cond    = cond_expr[[1]]
            msg     = msg_expr[[1]]
            .(txt = d[eval(cond), eval(msg)])
        }, by=.(msg_id, cat)]
    }
    

    For example ...

    > head(auditem(df, audDT))
       msg_id     cat                                                 txt
    1:      1 Balance        Account ID44 balance is 1845 over the limit.
    2:      1 Balance        Account ID70 balance is 1250 over the limit.
    3:      1 Balance        Account ID97 balance is 1937 over the limit.
    4:      2 Overdue Account ID2 is overdue 11 days beyond grace period.
    5:      2 Overdue  Account ID3 is overdue 1 days beyond grace period.
    6:      2 Overdue  Account ID6 is overdue 5 days beyond grace period.
    

    I'm not sure which of these changes made the difference:

    • eval predefined expressions instead of composing them in j inside a function
    • use a table for rules, with a few benefits:
      • since every entry should have the same structure, you can verify that each is well-formed (has no missing components)
      • msg_id can be auto-numbered with rbindlist and so doesn't have to be manually typed
      • by= can be used instead of lapply, since the latter has some weird evaluation behavior

    I also switched paste to sprintf but am sure that doesn't matter.

    The rapply is necessary since data.table doesn't support calls/expressions as a column type (apparently), but does support list columns.