In R, I want to summarize my data after grouping it based on the runs of a variable x
(aka each group of the data corresponds to a subset of the data where consecutive x
values are the same). For instance, consider the following data frame, where I want to compute the average y
value within each run of x
:
(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
# x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7
In this example, the x
variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y
in those groups are 2, 4.5, 6, and 7.
It is easy to carry out this grouped operation in base R using tapply
, passing dat$y
as the data, using rle
to compute the run number from dat$x
, and passing the desired summary function:
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
# 1 2 3 4
# 2.0 4.5 6.0 7.0
I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:
library(dplyr)
# First attempt
dat %>%
group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'
# Attempt 2 -- maybe "with" is the problem?
dat %>%
group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
summarize(mean(y))
# Error: invalid subscript type 'closure'
For completeness, I could reimplement the rle
run id myself using cumsum
, head
, and tail
to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:
dat %>%
group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%
summarize(mean(y))
# run mean(y)
# (dbl) (dbl)
# 1 1 2.0
# 2 2 4.5
# 3 3 6.0
# 4 4 7.0
What is causing my rle
-based grouping code to fail in dplyr
, and is there any solution that enables me to keep using rle
when grouping by run id?
Update: As of 2023, this appears to have been fixed by the dplyr package, such that my original code works fine, and there's no need for any workarounds.
One option seems to be the use of {}
as in:
dat %>%
group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
summarize(mean(y))
#Source: local data frame [4 x 2]
#
# yy mean(y)
# (int) (dbl)
#1 1 2.0
#2 2 4.5
#3 3 6.0
#4 4 7.0
It would be nice if future dplyr versions also had an equivalent of data.table's rleid
function.
I noticed that this problem occurs when using a data.frame
or tbl_df
input but not, when using a tbl_dt
or data.table
input:
dat %>%
tbl_df %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'
dat %>%
tbl_dt %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Source: local data table [4 x 2]
yy mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
I reported this as an issue on dplyr's github page.