Search code examples
rdplyrlaglead

dplyr::filter() based on dplyr::lag() without losing first values


When I filter a dataset based on a lag() function, I lose the first row in each group (because those rows have no lag value). How can I avoid this so that I keep the first rows despite their not having any lag value?

ds <- 
  structure(list(mpg = c(21, 21, 21.4, 18.7, 14.3, 16.4), cyl = c(6, 
  6, 6, 8, 8, 8), hp = c(110, 110, 110, 175, 245, 180)), class = c("tbl_df", 
  "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("mpg", 
  "cyl", "hp"))

# example of filter based on lag that drops first rows
ds %>% 
  group_by(cyl) %>% 
  arrange(-mpg) %>% 
  filter(hp <= lag(hp))

Solution

  • Having filter(hp <= lag(hp)) excludes rows where lag(hp) is NA. You can instead filter for either that inequality or for lag(hp), as is the case for those top rows of each group.

    I included prev = lag(hp) to make a standalone variable for the lags, just for clarity & debugging.

    library(tidyverse)
    
    ds %>%
        group_by(cyl) %>%
        arrange(-mpg) %>%
        mutate(prev = lag(hp)) %>%
        filter(hp <= prev | is.na(prev))
    

    This yields:

    # A tibble: 4 x 4
    # Groups:   cyl [2]
        mpg   cyl    hp  prev
      <dbl> <dbl> <dbl> <dbl>
    1  21.4    6.  110.   NA 
    2  21.0    6.  110.  110.
    3  21.0    6.  110.  110.
    4  18.7    8.  175.   NA