my need is simple: i have a data.frame with a grouping variable, like this:
library(dplyr)
proportion = 0.5; set.seed(1)
df = data.frame(id=1:6, name=c("a", "a", "b"), value=rnorm(6)) %>% arrange(name)
I want to keep only the first half of each group (when ordered by id
). (i'd like to work with a modifiable proportion instead of the half, like 0.65 because it's for data splitting in train/test purpose)
Many questions answer this but with a fix number of lines (using top_n()
, here) I don't know how to make it dependent on the size of each group, using dplyr
. And I don't want sample_frac()
because it would break the id
order.
However, I have come to a solution in 2 steps using a custom function:
myfunc = function(data, prop){head(data, nrow(data)*prop)}
splitted.data = split(df, df$name)
lapply(splitted.data, myfunc, prop=proportion) %>% bind_rows()
#### id name value
#### 1 1 a -0.6264538
#### 2 2 a 0.1836433
#### 3 3 b -0.8356286
But can I do this with dplyr
directly? Thanks
You can use n()
which will give you the number of rows in the grouped df. It doesn't work inside top_n
but it works inside filter
and slice
:
df %>%
group_by(name) %>%
filter(row_number() <= proportion * n())
or
df %>%
group_by(name) %>%
slice(seq(proportion * n()))