When I filter a dataset based on a lag() function, I lose the first row in each group (because those rows have no lag value). How can I avoid this so that I keep the first rows despite their not having any lag value?
ds <-
structure(list(mpg = c(21, 21, 21.4, 18.7, 14.3, 16.4), cyl = c(6,
6, 6, 8, 8, 8), hp = c(110, 110, 110, 175, 245, 180)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("mpg",
"cyl", "hp"))
# example of filter based on lag that drops first rows
ds %>%
group_by(cyl) %>%
arrange(-mpg) %>%
filter(hp <= lag(hp))
Having filter(hp <= lag(hp))
excludes rows where lag(hp)
is NA
. You can instead filter for either that inequality or for lag(hp)
, as is the case for those top rows of each group.
I included prev = lag(hp)
to make a standalone variable for the lags, just for clarity & debugging.
library(tidyverse)
ds %>%
group_by(cyl) %>%
arrange(-mpg) %>%
mutate(prev = lag(hp)) %>%
filter(hp <= prev | is.na(prev))
This yields:
# A tibble: 4 x 4
# Groups: cyl [2]
mpg cyl hp prev
<dbl> <dbl> <dbl> <dbl>
1 21.4 6. 110. NA
2 21.0 6. 110. 110.
3 21.0 6. 110. 110.
4 18.7 8. 175. NA