Selecting rows based on the value of columns in other rows

For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.

I have data consisting for individuals (ID) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)

all IDs have one and only one primary email address
IDs can have several non-primary email addresses (or none)
IDs can have the same email address as both primary and non-primary

For example:

   ID Email Primary
1   1     A       1
2   1     A       0
3   1     B       0
4   2     A       1
5   2     A       0
6   3     C       1
7   4     D       1
8   4     C       0
9   5     E       1
10  5     F       0

(The actual dataset has around half a million rows)

I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:

Primary is 0
There exists another row where that ID is Primary but for a different ID

Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID and row 8 (because it is non-primary, but primary in row 6 for a different ID) and row 2

For R users, here is the toy dataframe above:

structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))

Solution

You can select rows where

Primary = 0
number of ID's for that Email is greater than 1.
There is atleast one primary = 1 for that Email

Using dplyr, you can do this as :

library(dplyr)

df %>% 
   group_by(Email) %>% 
   filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))

#     ID Email Primary
#  <dbl> <chr>   <dbl>
#1     1 A           0
#2     2 A           0
#3     4 C           0

Since you have big data a data.table solution would be helpful :

library(data.table)

setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]