Search code examples
rdplyrunique

Calculate number of unique values in grouped matrix


I have a grouped data set that looks like this:

data = data.frame(group = c(1,1,1,1,2,2,2,2), 
                  c1 = c("A", "E", "A", "J", "L", "M", "L", "J"), 
                  c2 = c("B", "F", "F", "K", "B", "F", "T", "E"), 
                  c3 = c("C", "G", "C", "L", "C", "X", "C", "V"), 
                  c4 = c("D", "H", "I", "M", "D", "T", "I", "W"))

And I need to calculate the number of values in each row that are not duplicated within each group. For example, something that looks like this:

    group c1 c2 c3 c4 uniq.vals
1     1  A  B  C  D         2
2     1  E  F  G  H         3
3     1  A  F  C  I         1
4     1  J  K  L  M         4
5     2  L  B  C  D         2
6     2  M  F  X  T         3
7     2  L  T  C  I         1
8     2  J  E  V  W         4

The count for row 1 would be 2, because B and D do not show up in any of the other rows within group 1.

I am familiar with using group_by and summarize but I am having trouble extending that to this particular situation, which requires that each value be checked across multiple columns and rows. For example, n_distinct on its own would not work because I'm looking for non-duplicated values, not unique values.

Ideally the solution would also ignore NAs and not count them as duplicated or non-duplicated values.


Solution

  • Here is an option with tidyverse. Reshape to 'long' format with pivot_longer, grouped by 'group', replace all the duplicate 'value' to NA, then grouped by row number, summarise to get the counts with n_distinct (number of distinct elements), and bind with the original data

    library(dplyr)
    library(tidyr)
    data %>%
        mutate(rn = row_number()) %>%
        pivot_longer(cols = starts_with('c')) %>% 
        group_by(group) %>%
        mutate(value = replace(value, duplicated(value)|duplicated(value,
         fromLast = TRUE), NA)) %>%
        group_by(rn) %>%
        summarise(uniq.vals = n_distinct(value, na.rm = TRUE), .groups = 'drop') %>%
        select(uniq.vals) %>%
        bind_cols(data, .)
    

    -output

    #   group c1 c2 c3 c4 uniq.vals
    #1     1  A  B  C  D         2
    #2     1  E  F  G  H         3
    #3     1  A  F  C  I         1
    #4     1  J  K  L  M         4
    #5     2  L  B  C  D         2
    #6     2  M  F  X  T         3
    #7     2  L  T  C  I         1
    #8     2  J  E  V  W         4