Data.frame allows operations on column subsets using [
, dropping single column/row outputs to vectors by default. Dplyr does not allow this, deliberately (and seemingly because coding was an absolute nightmare).
df <- data.frame(a = c(1:5,NA), b = c(1,1,1,2,2,2))
mean(df[,"a"], na.rm = T) # 3
dftbl <- as.tbl(df)
mean(dftbl[,"a"], na.rm = T) # NA
Advice is therefore to subset with [[
as this will deliver uniform outputs for both dfs and tbl_dfs.
But: that's fine for columns or rows only, but not for rows+columns, and concerningly this difference can be missed if you don't check the warnings (which is my own fault admittedly), e.g.:
dfresult <- mean(df[df$b == 2, "a"], na.rm = T) # 4.5
tblresult <- mean(dftbl[dftbl$b == 2, "a"], na.rm = T) # NA_real_
Does anyone have any 'best practice' suggestions for performing column operations on row subsets? Is this where I should improve my dplyr
game using filter
& select
? My attempts thus far keep hitting walls. Grateful for any golden rules. Thanks in advance.
dftbl %>% filter(b == 2) %>% select(a) %>% mean(na.rm = T) #NA
This fails in the same way, with the filtered & selected data STILL being an N*1 tibble which refuses to play with mean
.
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% .$a
# [1] 4 5 NA
But
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% mean(.$a, na.rm = T)
# [1] NA
The reason is because we need [[
instead of [
as with [
it is still a tibble
with one column. The mean
expects the input as a vector
mean(dftbl[["a"]], na.rm = TRUE)
#[1] 3
Or use $
mean(dftb$a, na.rm = TRUE)
Regarding the second case, select
also returns a tibble with the selected columns. Instead, we can use pull
to extract as a vector
dftbl[dftbl$b == 2, "a"] %>%
pull(1)
#[1] 4 5 NA
Or if we don't want to load any libraries, use unlist
mean(unlist(dftbl[dftbl$b == 2, "a"]), na.rm = TRUE)
#[1] 4.5
For the code mentioned in the OP's post
dftbl %>%
filter(b == 2) %>%
select(a) %>%
.$a %>%
mean(., na.rm = TRUE)
#[1] 4.5
Or with pull
dftbl %>%
filter(b == 2) %>%
pull(a) %>%
mean(na.rm = TRUE)
#[1] 4.5