Search code examples
rdataframematchingrow-removal

Remove any data frame rows containing a value in one column which has multiple matches in another column


Let's say I have this (simplified) data frame:

C1 <- c('a','a','b','b','c','c')
C2 <- c(10,10,20,21,30,30)
C3 <- c(1.1,2.2,3.3,4.4,5.5,6.6)
df <- data.frame(C1,C2,C3)
C1 C2 C3
a 10 1.1
a 10 2.2
b 20 3.3
b 21 4.4
c 30 5.5
c 30 6.6

What I'm trying to do is to delete any rows containing a C1 value which has more than one match in the C2 column. In this case I would like to delete the entire rows containing 'b' in the C1 column (because 'b' has two matches - both 20 and 21 - in column C2).

This should result with this df:

C1 C2 C3
a 10 1.1
a 10 2.2
c 30 5.5
c 30 6.6

Any help would be really appreciated!

Thanks,

Yuval


Solution

  • dplyr is another way to do this. Use group_by to process each C1 group separately, then filter each group, keeping only groups with a single value of C2

    library(dplyr)
    
    C1 <- c('a','a','b','b','c','c')
    C2 <- c(10,10,20,21,30,30)
    C3 <- c(1.1,2.2,3.3,4.4,5.5,6.6)
    df <- data.frame(C1,C2,C3)
    
    df <- df %>%
        group_by(C1) %>%
        filter(length(unique(C2)) == 1) %>%
        ungroup()
    
    print(df)
    

    Output

    # A tibble: 4 x 3
      C1       C2    C3
      <chr> <dbl> <dbl>
    1 a        10   1.1
    2 a        10   2.2
    3 c        30   5.5
    4 c        30   6.6