Search code examples
rcountfrequency

Calculate the frequency of similar rows but keeping the same size of dataframe


I have a data frame with repeated rows and i have a function that calculate the frequency of similar rows. Here is my sample

#############
###Sample####
#############

ID=seq(from=1,to=12,by=1)
var1=c(rep("a",12))
var2=c(rep("b",12))
var3=c("c","c","b","d","e","f","g","h","i","j","k","k")
df=data.frame(ID,var1,var2,var3)

   ID var1 var2 var3
1   1    a    b    c
2   2    a    b    c
3   3    a    b    b
4   4    a    b    d
5   5    a    b    e
6   6    a    b    f
7   7    a    b    g
8   8    a    b    h
9   9    a    b    i
10 10    a    b    j
11 11    a    b    k
12 12    a    b    k

###############
# function ####
###############

freq.f<- function(data){
  vari=colnames(data[2:ncol(data)])
  data  %>%     
    dplyr:: count(!!! rlang::syms(vari))  %>%
    mutate(frequency = n/sum(n))
  
}

# current output
freq.f(df)
   var1 var2 var3 n  frequency
1     a    b    b 1 0.08333333
2     a    b    c 2 0.16666667
3     a    b    d 1 0.08333333
4     a    b    e 1 0.08333333
5     a    b    f 1 0.08333333
6     a    b    g 1 0.08333333
7     a    b    h 1 0.08333333
8     a    b    i 1 0.08333333
9     a    b    j 1 0.08333333
10    a    b    k 2 0.16666667

What i want is calculating this frequency but keeping all my records because my ID are different persons even if they have the same row information, and i also want to be able to print the ID in my output to keep track of the individuals. So the desired output is

# desired output

   ID var1 var2 var3 n  freq
1   1    a    b    c 2  0.16666667
2   2    a    b    c 2  0.16666667
3   3    a    b    b 1  0.08333333
4   4    a    b    d 1  0.08333333
5   5    a    b    e 1  0.08333333
6   6    a    b    f 1  0.08333333
7   7    a    b    g 1  0.08333333
8   8    a    b    h 1  0.08333333
9   9    a    b    i 1  0.08333333
10 10    a    b    j 1  0.08333333
11 11    a    b    k 2  0.16666667
12 12    a    b    k 2  0.16666667

I really looked in almost every post in SO about frequency but can not find my answer. Thank you in advance for your help.


Solution

  • Adding a join within your function provides expected results.

    freq.f<- function(data){
      vari=colnames(data[2:ncol(data)])
      inner_join(data, data  %>%     ##this is the new line
        dplyr:: count(!!! rlang::syms(vari))  %>%
        mutate(frequency = n/sum(n)))
    }
    freq.f(df)
    
       ID var1 var2 var3 n  frequency
    1   1    a    b    c 2 0.16666667
    2   2    a    b    c 2 0.16666667
    3   3    a    b    b 1 0.08333333
    4   4    a    b    d 1 0.08333333
    5   5    a    b    e 1 0.08333333
    6   6    a    b    f 1 0.08333333
    7   7    a    b    g 1 0.08333333
    8   8    a    b    h 1 0.08333333
    9   9    a    b    i 1 0.08333333
    10 10    a    b    j 1 0.08333333
    11 11    a    b    k 2 0.16666667
    12 12    a    b    k 2 0.16666667