Search code examples
rflags

Flagging an id based on another column has different values in R


I have a flagging rule need to apply.

Here is how my dataset looks like:

    df <- data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
                 key = c("a","a","b","c", "a","b","c","d", "a","b","c","c"),
                 form = c("A","B","A","A", "A","A","A","A", "B","B","B","A"))
    
    > df
   id key form
1   1   a    A
2   1   a    B
3   1   b    A
4   1   c    A
5   2   a    A
6   2   b    A
7   2   c    A
8   2   d    A
9   3   a    B
10  3   b    B
11  3   c    B
12  3   c    A

I would like to flag ids based on a key columns that has duplicates, a third column of form shows different forms for each key. The idea is to understand if an id has taken any items from multiple forms. I need to add a filtering column as below:

> df.1
   id key form type
1   1   a    A multiple
2   1   a    B multiple
3   1   b    A multiple 
4   1   c    A multiple
5   2   a    A single
6   2   b    A single
7   2   c    A single
8   2   d    A single
9   3   a    B multiple
10  3   b    B multiple
11  3   c    B multiple
12  3   c    A multiple

And eventually I need to get rid off the extra duplicated row which has different form. To decide which of the duplicated one drops, I pick whichever the form type has more items.

In a final separate dataset, I would like to have something like below:

  > df.2
       id key form type
    1   1   a    A multiple
    3   1   b    A multiple 
    4   1   c    A multiple
    5   2   a    A single
    6   2   b    A single
    7   2   c    A single
    8   2   d    A single
    9   3   a    B multiple
    10  3   b    B multiple
    11  3   c    B multiple

So first id has form A dominant so kept the A, and the third id has form B dominant so kept the B.

Any ideas? Thanks!


Solution

  • We can check number of distinct elements to create the new column by group and then filter based on the highest frequency (Mode)

    library(dplyr)
    df.2 <- df %>% 
      group_by(id) %>%
      mutate(type = if(n_distinct(form) > 1) 'multiple' else 'single') %>% 
      filter(form == Mode(form)) %>%
      ungroup
    

    -output

    > df.2
    # A tibble: 10 × 4
          id key   form  type    
       <dbl> <chr> <chr> <chr>   
     1     1 a     A     multiple
     2     1 b     A     multiple
     3     1 c     A     multiple
     4     2 a     A     single  
     5     2 b     A     single  
     6     2 c     A     single  
     7     2 d     A     single  
     8     3 a     B     multiple
     9     3 b     B     multiple
    10     3 c     B     multiple
    

    where

    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }