How can I ensure that my nonstandard evaluation1 using data.table
is inheriting the variables it needs from the parent frame?
Based on my understanding of dynamic scope, my code below should work, but it doesn't. What am I doing wrong?
I have a list of many functions I want to apply to a single data.table
that return boolean checks and messages (for when the check is TRUE). For example, let's say I am auditing a table of accounts.
library(data.table)
#----- Example data -----------------------------------------------------------
n <- 100
set.seed(123)
df <- data.table( acct_id = paste0('ID',seq(n)),
acct_balance = round(pmax(rnorm(n,1000,5000),0)),
days_overdue = round(pmax(rnorm(n,20,20),0))
)
#----- Example list of rules to check (real case has more elements)------------
AuditRules <- list(
list(
msg_id = 1,
msg_cat = 'Balance',
cond_fn = function(d) d[, acct_balance > balance_limit ],
msg_txt =
function(d) d[, paste('Account',acct_id,'balance is',
acct_balance - balance_limit,
'over the limit.')]
),
list(
msg_id = 2,
msg_cat = 'Overdue',
cond_fn = function(d) d[, days_overdue > grace_period ],
msg_txt =
function(d) d[, paste('Account',acct_id,'is overdue',
days_overdue-grace_period,
'days beyond grace period.')]
)
)
I am looping through the list of rules and checking the dataset on each.
This works fine in the global environment.
balance_limit <- 1e4
grace_period <- 14
audit <- rbindlist(
lapply(AuditRules, function(item){
with( item,
df[ cond_fn(df),
.(msg_id,
msg_cat,
msg_txt = msg_txt(.SD) )
]
)
} )
)
print(head(audit), row.names=FALSE)
#----------------- Result --------------------------------------
# msg_id msg_cat msg_txt
# 1 Balance Account ID44 balance is 1845 over the limit.
# 1 Balance Account ID70 balance is 1250 over the limit.
# 1 Balance Account ID97 balance is 1937 over the limit.
# 2 Overdue Account ID2 is overdue 11 days beyond grace period.
# 2 Overdue Account ID3 is overdue 1 days beyond grace period.
# 2 Overdue Account ID6 is overdue 5 days beyond grace period.
rm(balance_limit, grace_period) # see "aside"
auditTheData <- function(d, balance_limit = 1e4, grace_period=14){
rbindlist(
lapply(AuditRules, function(item){
with( item,
d[ cond_fn(d),
.(msg_id,
msg_cat,
msg_txt = msg_txt(.SD) )
]
)
} )
)
}
auditTheData(df)
results in the error:
Error in eval(jsub, SDenv, parent.frame()) : object 'balance_limit' not found
It's not a problem with with()
, although I've read (?with
) that typically one should refrain from using it for programming. This also doesn't work:
auditTheData2 <- function(d, balance_limit = 1e4, grace_period=14){
rbindlist(
lapply(AuditRules, function(item){
d[ item[['cond_fn']](d),
.(msg_id,
msg_cat,
msg_txt = item[['msg_txt']](.SD) )
]
} )
)
}
auditTheData2(df) # Same error
Aside: if you don't do rm(balance_limit, grace_period)
before the "what doesn't work" function -- i.e. leave them in the global environment -- you get the desired results. So it seems like the function(item)
that is getting lapply
-ed can "see" into the global environment but not the parent environment (AuditTheData
).
1I'm using "non-standard" in the unscientific sense of "unusual" here. Idk what counts as non-standard, but that's another (and a too broad?) question.
This seems to work:
ar <- list(
list(
cat = 'Balance',
cond_expr = quote(acct_balance > balance_limit),
msg_expr = quote(sprintf('Account %s balance is %s over the limit.',
acct_id,
acct_balance - balance_limit))
),
list(
cat = 'Overdue',
cond_expr = quote(days_overdue > grace_period),
msg_expr = quote(sprintf('Account %s is overdue %s days beyond grace period.',
acct_id,
days_overdue-grace_period))
)
)
audDT = rbindlist(rapply(ar, list, "call", how = "replace"), id="msg_id")
auditem = function(d, a, balance_limit = 1e4, grace_period = 14){
a[, {
cond = cond_expr[[1]]
msg = msg_expr[[1]]
.(txt = d[eval(cond), eval(msg)])
}, by=.(msg_id, cat)]
}
For example ...
> head(auditem(df, audDT))
msg_id cat txt
1: 1 Balance Account ID44 balance is 1845 over the limit.
2: 1 Balance Account ID70 balance is 1250 over the limit.
3: 1 Balance Account ID97 balance is 1937 over the limit.
4: 2 Overdue Account ID2 is overdue 11 days beyond grace period.
5: 2 Overdue Account ID3 is overdue 1 days beyond grace period.
6: 2 Overdue Account ID6 is overdue 5 days beyond grace period.
I'm not sure which of these changes made the difference:
eval
predefined expressions instead of composing them in j
inside a functionmsg_id
can be auto-numbered with rbindlist
and so doesn't have to be manually typedby=
can be used instead of lapply
, since the latter has some weird evaluation behaviorI also switched paste
to sprintf
but am sure that doesn't matter.
The rapply
is necessary since data.table doesn't support calls/expressions as a column type (apparently), but does support list columns.