Search code examples
sqlrdplyrsubquerywhere-clause

Selecting rows based on the value of columns in other rows


For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.

I have data consisting for individuals (ID) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)

  • all IDs have one and only one primary email address
  • IDs can have several non-primary email addresses (or none)
  • IDs can have the same email address as both primary and non-primary

For example:

   ID Email Primary
1   1     A       1
2   1     A       0
3   1     B       0
4   2     A       1
5   2     A       0
6   3     C       1
7   4     D       1
8   4     C       0
9   5     E       1
10  5     F       0

(The actual dataset has around half a million rows)

I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:

  • Primary is 0
  • There exists another row where that ID is Primary but for a different ID

Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID and row 8 (because it is non-primary, but primary in row 6 for a different ID) and row 2

For R users, here is the toy dataframe above:

structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))

Solution

  • You can select rows where

    • Primary = 0
    • number of ID's for that Email is greater than 1.
    • There is atleast one primary = 1 for that Email

    Using dplyr, you can do this as :

    library(dplyr)
    
    df %>% 
       group_by(Email) %>% 
       filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))
    
    #     ID Email Primary
    #  <dbl> <chr>   <dbl>
    #1     1 A           0
    #2     2 A           0
    #3     4 C           0
    

    Since you have big data a data.table solution would be helpful :

    library(data.table)
    
    setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]