Search code examples
rfilterdplyrtidyverselag

dplyr::filter rows with characters using lag/lead matching several columns


I am attempting to filter rows from a dataframe (df) using the contents in specific columns (col1, col2, and col3) in the next row.

This question got close but only uses one column to lag

Most of the posts showing how to filter using lag/lead have columns which are numbers, in my case they are all text.

df <- tibble::tribble(
  ~col1,  ~col2,     ~col3,          ~Effect,
  "Jim", "Walk", "optionA",      "col1×col2",
  "Jim", "Walk", "optionA", "col1×col2×col2",
  "Jim",  "Run", "optionB",           "col1",
  "Jim",  "Run", "optionB",      "col1×col2",
  "Jim",  "Run", "optionB", "col1×col2×col2",
  "Joe", "Walk", "optionA",           "col1",
  "Joe", "Walk", "optionA",      "col1×col2",
  "Joe",  "Run", "optionB", "col1×col2×col2"
  )

I would like to filter rows if the next row (except the Effect column) is identical.

The final dataframe would look like this

df_result <- tibble::tribble(
  ~col1,  ~col2,     ~col3,          ~Effect,
  "Jim", "Walk", "optionA", "col1×col2×col2",
  "Jim",  "Run", "optionB", "col1×col2×col2",
  "Joe", "Walk", "optionA",      "col1×col2",
  "Joe",  "Run", "optionB", "col1×col2×col2"
  )

Does anyone have any suggestions? If possible I would like to use the tidyverse to get the solution.


Solution

  • We can use distinct

    library(dplyr)
    df %>%
       slice(rev(row_number())) %>%
      distinct(across(col1:col3), .keep_all = TRUE)
    

    -ouptut

    # A tibble: 4 x 4
      col1  col2  col3    Effect        
      <chr> <chr> <chr>   <chr>         
    1 Joe   Run   optionB col1×col2×col2
    2 Joe   Walk  optionA col1×col2     
    3 Jim   Run   optionB col1×col2×col2
    4 Jim   Walk  optionA col1×col2×col2
    

    Or using nchar

    df %>%
        group_by(across(col1:col3)) %>%
        slice(which.max(nchar(Effect))) %>% 
        ungroup