Search code examples
rdplyruniquedistinct

Count unique strings that only occur in a single group based on all possible groups


I have the following df

a = data.frame(PA = c("A", "A", "A", "B", "B"), Family = c("aa", "ab", "ac", "aa", "ad"))

What I want to obtain is a count of unique 'Family' strings (aa, ab, ac, ad) in each PA (A or B) based on all possible PAs. For example, aa is a unique string for A and B, but since it occurs in both PAs I don't want it. On the other hand, ab and ac are unique for PA A and only occur in PA A: that's what I want.

Using dplyr I was doing something like:

df >%> group_by(PA) %>%
summarise(count_family = n_distinct(Family))

But this only returns unique terms inside each PA — and I want unique Families that occur inside unique PAs based on all possible PAs


Solution

  • Here's a tidyverse approach.

    First remove all duplicated Family, then group_by(PA) and count.

    library(tidyverse)
    
    a %>% group_by(Family) %>% 
      filter(n() == 1) %>% 
      group_by(PA) %>%  
      summarize(count_family = n())
    

    Output

    # A tibble: 2 x 2
      PA    count_family
      <chr>        <int>
    1 A                2
    2 B                1
    

    Output before summarise()

    # A tibble: 3 x 2
    # Groups:   Family [3]
      PA    Family
      <chr> <chr> 
    1 A     ab    
    2 A     ac    
    3 B     ad