Search code examples
rduplicatessubsetunique

R - Identify and remove duplicate rows based on two columns


I have some data that looks like this:

Course_ID   Text_ID
33          17
33          17
58          17
5           22
8           22
42          25
42          25
17          26
17          26
35          39
51          39

Not having a background in programming, I'm finding it tricky to articulate my question, but here goes: I only want to keep rows where Course_ID varies but where Text_ID is the same. So for example, the final data would look something like this:

Course_ID   Text_ID
5           22
8           22
35          39
51          39

As you can see, Text_ID 22 and 39 are the only ones that have different Course_ID values. I suspect subsetting the data would be the way to go, but as I said, I'm quite a novice at this kind of thing and would really appreciate any advice on how to approach this.


Solution

  • Select those groups where there is no repeats of Course_ID.

    In dplyr you can write this as -

    library(dplyr)
    df %>% group_by(Text_ID) %>% filter(n_distinct(Course_ID) == n()) %>% ungroup
    
    #  Course_ID Text_ID
    #      <int>   <int>
    #1         5      22
    #2         8      22
    #3        35      39
    #4        51      39
    

    and in data.table -

    library(data.table)
    setDT(df)[, .SD[uniqueN(Course_ID) == .N], Text_ID]