Search code examples
rtidy

How to find elements common in at least half of elements in an R tibble


I have a tibble of values:

raw = tibble(
      labels = rep(rep(1:4,each=3),2),
       group = rep(c("A","B"), each=12),
       value = c(1,2,3,3,4,5,6,7,2,2,12,1,7,3,3,3,4,5,6,3,2,2,7,1))

I want to select for each group A and B seperatlty the common value in at least half of their for labels. The result may be

Res = tibble(group = c("A","B"),
       value = c("1,2,3","2,3,7"))

It will be helpful if I can find a flexible function to do the same selection for at least 1/3 of labels.


Solution

  • Here is one option where we do a grouping by 'group', 'value', get the number of distinct 'labels', then do a group by 'group' and filter the rowss where the 'n' is greater than or equal to the number of distinct 'labels' by 2 i.e. 50%, get the distinct rows of 'group', 'value'

    library(dplyr)
    raw %>%
       group_by(group, value) %>%
       mutate(n = n_distinct(labels)) %>%
       group_by(group) %>% 
       filter(n >= n_distinct(labels)/2) %>% 
       select(-n) %>%
       ungroup %>% 
       distinct(group, value)
    # A tibble: 6 x 2
    #  group value
    #  <chr> <dbl>
    #1 A         1
    #2 A         2
    #3 A         3
    #4 B         7
    #5 B         3
    #6 B         2