Search code examples
rif-statementselectdplyraggregate

select a grouped row based on another variable


I have a data set where for each region there is a value pos, but some regions don't have a value of 1.

region <- c('A','A','B','B','C','D','D','E')
pos <- c(0,1,0,1,0,0,1,0)
df <- data.frame(region, pos)

For each region, I would like to select the row where the value of pos is 1. If there is a region where there is no pos value of 1, I would like to select the 0 row. So in essence I would like to have a resulting data set that has one row for each region but for the regions where this is a 1, I would like to select that row. There is never more than 2 rows for each region.

The result should look like this:

    df <
      Region pos
    1   A    1
    2   B    1
    3   C    0
    4   D    1
    5   E    0

Thanks in advance for your help.


Solution

  • If you're certain all values are 0s and 1s, you might be able to get away with:

    library(dplyr)
    df %>%
      slice_max(pos, by = region, with_ties = FALSE)
    #   region pos
    # 1      A   1
    # 2      B   1
    # 3      C   0
    # 4      D   1
    # 5      E   0
    

    (You can verify this returns at most one row per region by using bind_rows(df, df) instead of df, just to demonstrate.)

    If that's not safe, though, then something like

    df %>%
      group_by(region) %>%
      filter(row_number() == c(which(pos == 1), which(pos == 0))[1]) %>%
      ungroup()
    # # A tibble: 5 × 2
    #   region   pos
    #   <chr>  <dbl>
    # 1 A          1
    # 2 B          1
    # 3 C          0
    # 4 D          1
    # 5 E          0
    

    In this case, if a region has neither pos == 1 nor pos == 0 then the region will go away.