Search code examples
rduplicatesfrequency

Calculate the frequency of duplicated rows, based on specific columns, but keeping the ID of one of the duplicated rows in R


I am trying to calculate the frequency of duplicated rows based on specific columns but i want to keep the id of each one of the duplicated rows because i will need it after to merge other datasets.

Here is my sample data

############
## Sample ##
############

ID=seq(from=1,to=12,by=1)
var1=c(rep("a",12))
var2=c(rep("b",12))
var3=c("c","c","b","d","e","f","g","h","i","j","k","k")
df=data.frame(ID,var1,var2,var3)

df
   ID var1 var2 var3
1   1    a    b    c
2   2    a    b    c
3   3    a    b    b
4   4    a    b    d
5   5    a    b    e
6   6    a    b    f
7   7    a    b    g
8   8    a    b    h
9   9    a    b    i
10 10    a    b    j
11 11    a    b    k
12 12    a    b    k

Here is my function

freq.f<- function(data){
  vari=colnames(data[2:ncol(data)])
  data  %>%     
    dplyr:: count(!!! rlang::syms(vari))  %>%
    mutate(frequency = n/sum(n))
  
}

Here is my output

freq.f(data=df)

   var1 var2 var3 n  frequency
1     a    b    b 1 0.08333333
2     a    b    c 2 0.16666667
3     a    b    d 1 0.08333333
4     a    b    e 1 0.08333333
5     a    b    f 1 0.08333333
6     a    b    g 1 0.08333333
7     a    b    h 1 0.08333333
8     a    b    i 1 0.08333333
9     a    b    j 1 0.08333333
10    a    b    k 2 0.16666667

As you can see i have the set a,b,c duplicated 2 times corresponding to the ID 1 and 2. What i would like to have is the a,b,c with ID = 1, the same for the set a,b,k. So the desired output would be like

# desired output

   ID   var1 var2 var3 n  frequency
1  3     a    b    b   1  0.08333333
2  1     a    b    c   2  0.16666667
3  4     a    b    d   1  0.08333333
4  5     a    b    e   1  0.08333333
5  6     a    b    f   1  0.08333333
6  7     a    b    g   1  0.08333333
7  8     a    b    h   1  0.08333333
8  9     a    b    i   1  0.08333333
9  10    a    b    j   1  0.08333333
10 11    a    b    k   2  0.16666667

Thank you in advance for your help.


Solution

  • We could mutate to create the count and then filter or slice or use distinct

    library(dplyr)
    df %>% 
         group_by(var1, var2, var3) %>%
         mutate(n = n()) %>%
         ungroup %>% 
         distinct(var1, var2, var3, .keep_all = TRUE) %>% 
         mutate(frequency = n/sum(n))
    

    -output

    # A tibble: 10 x 6
    #      ID var1  var2  var3      n frequency
    #   <dbl> <chr> <chr> <chr> <int>     <dbl>
    # 1     1 a     b     c         2    0.167 
    # 2     3 a     b     b         1    0.0833
    # 3     4 a     b     d         1    0.0833
    # 4     5 a     b     e         1    0.0833
    # 5     6 a     b     f         1    0.0833
    # 6     7 a     b     g         1    0.0833
    # 7     8 a     b     h         1    0.0833
    # 8     9 a     b     i         1    0.0833
    # 9    10 a     b     j         1    0.0833
    #10    11 a     b     k         2    0.167 
    

    Or make it compact with add_count

    df %>%
        add_count(var1, var2, var3) %>% 
        distinct(var1, var2, var3, .keep_all = TRUE) %>%
        mutate(frequency = n/sum(n))
    

    Or if we use the count, then do a right_join with the original data and then use distinct

    df %>% 
        count(var1, var2, var3) %>% 
        mutate(frequency = n/sum(n)) %>% 
        right_join(df) %>% 
        distinct(var1, var2, var3, .keep_all = TRUE)