I have built a function which seems to work, but I don't understand why.
My initial problem was to take a data.frame which contains counts of a population and expand it to re-create the original population. This is easy enough if you know the column names in advance.
library(tidyverse)
set.seed(121)
test_counts <- tibble(Population = letters[1:4], Length = c(1,1,2,1),
Number = sample(1:100, 4))
expand_counts_v0 <- function(Length, Population, Number) {
tibble(Population = Population,
Length = rep(Length, times = Number))
}
test_counts %>% pmap_dfr(expand_counts_v0) %>% # apply it
group_by(Population, Length) %>% # test it
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
However, I wanted to generalise it to a function which didn't need to know at the column names of the data.frame, and I'm interested in NSE, so I wrote:
test_counts1 <- tibble(Population = letters[1:4],
Length = c(1,1,2,1),
Number = sample(1:100, 4),
Height = c(100, 50, 45, 90),
Width = c(700, 50, 60, 90)
)
expand_counts_v1 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% pmap_dfr(make_tbl)
}
But, when I test this function it seems to duplicate rows 4 times:
test_counts %>% expand_counts_v1(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ sum(.$Number)/sum(test_counts$Number)}
# [1] 4
This lead me to guess a solution, which was
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% make_tbl
}
This seems to work:
test_counts %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
test_counts1 %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length, Height, Width) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts1)}
# [1] TRUE
But I don't understand why. How is it evaluating for each row, even though I'm not using pmap anymore? The function needs to be applied to each row in order to work, so it must be somehow, but I can't see how it's doing that.
EDIT
After Artem's correct explanation of what was going on, I realised I could do this
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
expr(tibble(!!!cols)) %>% eval_tidy(data = df)
}
Which gets rid of the unnecessary mk_tbl function. However, as Artem said, that is only really working because rep is vectorised. So, it's working, but not by re-writing the _v0 function and pmapping it, which is the process I was trying to replicate. Eventually, I discovered, rlang::new_function and wrote:
expand_counts_v3 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
all_names <- df %>% names %>% map(as.name)
args <- rep(0, times = length(all_names)) %>% as.list %>% set_names(all_names)
correct_function <- new_function(args, # this makes the function as in _v0
expr(tibble(!!!cols)) )
pmap_dfr(df, correct_function) # applies it as in _v0
}
which is longer, and probably uglier, but works the way I originally wanted.
The issue is in eval( envir = df )
, which exposes the entire data frame to make_tbl()
. Notice that you never use ...
argument inside make_tbl()
. Instead, the function effectively computes the equivalent of
with( df, tibble(Population = rep(Population, times = Number),
Length = rep(Length, times=Number)) )
regardless of what arguments you provide to it. When you call the function via pmap_dfr()
, it essentially computes the above four times (once for each row) and concatenates the results by-row, resulting in the duplication of entries you've observed. When you remove pmap_dfr()
, the function is called once, but since rep
is itself vectorized (try doing rep( test_counts$Population, test_counts$Number )
to see what I mean), make_tbl()
computes the entire result in one go.