So I am trying to convert my dplyr into DT for quicker processing time, but I am unable to convert my ifelse(any(startsWith... statement to DT. Whatever I try, it keeps doing one extreme or the other, or with the case of "Tag" it just says it doesn't exist. Maybe the problem is with rowwise but I can't figure it out. Thanks in advance!
Here's my dplyr code:
df <- df %>%
rowwise() %>%
mutate(
'Position' = coalesce(
ifelse(any(c_across(starts_with("Tag")) == "goalkeeper"), "Goalkeeper", NA),
ifelse(any(c_across(starts_with("Tag")) == "striker"), "Striker", NA),
),
Favorite = ifelse(any(c_across(starts_with("Tag")) == "favorite"), TRUE, FALSE),
across(starts_with("Tag"), ~ifelse(. %in% c("goalkeeper", "striker", "favorite"), NA_character_, .))
)
my DT attempts
df[, `Position` := coalesce(
ifelse(any(startsWith(Tag, "goalkeeper")), "Goalkeeper", NA_character_), #tried this
ifelse(grepl("striker", "^Tag"), "Striker", NA_character_), #and this
)]
df[, Favorite := any(startsWith(Tag1, "favorite"))]
df[, (grep("Tag", names(df), value = TRUE)) :=
lapply(.SD, function(x) ifelse(x %in% c("goalkeeper", "striker", "favorite"), NA_character_, x)),
.SDcols = patterns("Tag")]
Data:
Name | Tag1 | Tag2 | Tag3 |
---|---|---|---|
A | goalkeeper | NA | NA |
B | NA | striker | favorite |
Expected output:
Name | Position | Favorite |
---|---|---|
A | Goalkeeper | FALSE |
B | Striker | TRUE |
Since you're doing multi-column snapshots row-wise, I don't know that there are awesome ways to go about it, but perhaps this is sufficient?
tags <- grep("Tag", names(df), value=TRUE)
tags
# [1] "Tag1" "Tag2" "Tag3"
df[, c("Position", "Favorite") := .(
apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]),
apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags]
df
# Name Tag1 Tag2 Tag3 Position Favorite
# <char> <char> <char> <char> <char> <lgcl>
# 1: A goalkeeper <NA> <NA> goalkeeper FALSE
# 2: B <NA> striker favorite striker TRUE
(And you can easily remove the tags.)
The use of apply
is a little costly in that it causes the frame (.SD
, which in this case is just the Tag#
columns) to be converted to a matrix
internally. It's because of this conversion that the use of apply
in the context of frame rows can be expensive, rightfully so.
An alternative:
fun <- function(...) {
dots <- unlist(list(...))
list(Position = intersect(c("goalkeeper", "striker"), dots)[1], Favorite = "favorite" %in% dots)
}
df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags]
The two perform at somewhat the same speed (median
, `itr/sec`
) but the first has a lower mem_alloc
, perhaps suggesting that it may be better for larger data. But don't be too hasty benchmarking on small data ...
bench::mark(
a = df[, c("Position", "Favorite") := .(
apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]),
apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
b = df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags],
min_iterations=10000)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 a 243µs 288µs 3262. 16.4KB 12.1 9963 37 3.05s <dt> <Rprofmem> <bench_tm> <tibble>
# 2 b 253µs 293µs 3109. 48.7KB 10.6 9966 34 3.21s <dt> <Rprofmem> <bench_tm> <tibble>
Expanding it to be a larger dataset,
dfbig <- rbindlist(replicate(10000, df, simplify=FALSE))
we get these benchmarking results:
bench::mark(
a = dfbig[, c("Position", "Favorite") := .(
apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]),
apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
b = dfbig[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags],
min_iterations = 500)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 a 202ms 257ms 3.78 2.69MB 12.5 500 1655 2.21m <dt> <Rprofmem> <bench_tm> <tibble>
# 2 b 218ms 398ms 2.56 908.43KB 6.19 500 1210 3.26m <dt> <Rprofmem> <bench_tm> <tibble>
The mem_alloc
is lower for the second (Map
) implementation, though median
and `itr/sec`
are a little slower. I don't know which is better in your case.