Search code examples
rduplicatesunique

Checking if a value appears again by group


I am working in R

sample data:

I have people which have an id which has been assigned from their social security/national insurance number (actual_id). actual_id is assumed to be correct.

This column may not be needed, but I have worked out how many times each person appears in the data (number). e.g. we can see the person with actual_id = 1 appears twice in the data, so number = 2 for this person.

I have created a new method of id-ing people, which has nothing to do with social security/national insurance number. The new method has generated an id which is in the proposed_id column.

df <- data.frame(actual_id = c(1, 1, 2, 2, 2, 3, 3), 
                 proposed_id = c("a", "a", "b", "b", "c", "d", "b"), 
                 number = c(2, 2, 3, 3, 3, 2, 2))

print(df)

actual_id number proposed_id
1 2 a
1 2 a
2 3 b
2 3 b
2 3 c
3 2 d
3 2 b

Desired outcome

I want a way of assessing if the proposed_id is used only once for any actual_id:

  • We can see that proposed_id = a has been allocated to actual_id = 1, and does not appear again in proposed_id column. This would therefore be allocated "TRUE".
  • We can see that proposed_id = b has been allocated to actual_id = 2, but it has then been allocated further down to proposed_id = 3. This would be allocated as "FALSE".

My actual data has 90,000 rows.

actual_id number proposed_id assessment
1 2 a TRUE
1 2 a TRUE
2 3 b FALSE
2 3 b FALSE
2 3 c TRUE
3 2 d TRUE
3 2 b FALSE

Solution

  • base R

    with(df, ave(actual_id, proposed_id, FUN = function(z) length(unique(z)) == 1)) > 0
    # [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
    

    dplyr

    library(dplyr)
    df %>%
      mutate(assessment = n_distinct(actual_id) == 1, .by = proposed_id)
    #   actual_id proposed_id number assessment
    # 1         1           a      2       TRUE
    # 2         1           a      2       TRUE
    # 3         2           b      3      FALSE
    # 4         2           b      3      FALSE
    # 5         2           c      3       TRUE
    # 6         3           d      2       TRUE
    # 7         3           b      2      FALSE
    

    data.table

    library(data.table)
    as.data.table(df)[, assessment := uniqueN(actual_id) == 1, by = "proposed_id"][]
    #    actual_id proposed_id number assessment
    #        <num>      <char>  <num>     <lgcl>
    # 1:         1           a      2       TRUE
    # 2:         1           a      2       TRUE
    # 3:         2           b      3      FALSE
    # 4:         2           b      3      FALSE
    # 5:         2           c      3       TRUE
    # 6:         3           d      2       TRUE
    # 7:         3           b      2      FALSE
    

    Sample data

    df <- structure(list(actual_id = c(1, 1, 2, 2, 2, 3, 3), proposed_id = c("a", "a", "b", "b", "c", "d", "b"), number = c(2, 2, 3, 3, 3, 2, 2)), class = "data.frame", row.names = c(NA, -7L))