I am trying to calculate the frequency of duplicated rows based on specific columns but i want to keep the id of each one of the duplicated rows because i will need it after to merge other datasets.
Here is my sample data
############
## Sample ##
############
ID=seq(from=1,to=12,by=1)
var1=c(rep("a",12))
var2=c(rep("b",12))
var3=c("c","c","b","d","e","f","g","h","i","j","k","k")
df=data.frame(ID,var1,var2,var3)
df
ID var1 var2 var3
1 1 a b c
2 2 a b c
3 3 a b b
4 4 a b d
5 5 a b e
6 6 a b f
7 7 a b g
8 8 a b h
9 9 a b i
10 10 a b j
11 11 a b k
12 12 a b k
Here is my function
freq.f<- function(data){
vari=colnames(data[2:ncol(data)])
data %>%
dplyr:: count(!!! rlang::syms(vari)) %>%
mutate(frequency = n/sum(n))
}
Here is my output
freq.f(data=df)
var1 var2 var3 n frequency
1 a b b 1 0.08333333
2 a b c 2 0.16666667
3 a b d 1 0.08333333
4 a b e 1 0.08333333
5 a b f 1 0.08333333
6 a b g 1 0.08333333
7 a b h 1 0.08333333
8 a b i 1 0.08333333
9 a b j 1 0.08333333
10 a b k 2 0.16666667
As you can see i have the set a,b,c
duplicated 2 times corresponding to the ID
1 and 2. What i would like to have is the a,b,c
with ID = 1
, the same for the set a,b,k
. So the desired output would be like
# desired output
ID var1 var2 var3 n frequency
1 3 a b b 1 0.08333333
2 1 a b c 2 0.16666667
3 4 a b d 1 0.08333333
4 5 a b e 1 0.08333333
5 6 a b f 1 0.08333333
6 7 a b g 1 0.08333333
7 8 a b h 1 0.08333333
8 9 a b i 1 0.08333333
9 10 a b j 1 0.08333333
10 11 a b k 2 0.16666667
Thank you in advance for your help.
We could mutate
to create the count and then filter
or slice
or use distinct
library(dplyr)
df %>%
group_by(var1, var2, var3) %>%
mutate(n = n()) %>%
ungroup %>%
distinct(var1, var2, var3, .keep_all = TRUE) %>%
mutate(frequency = n/sum(n))
-output
# A tibble: 10 x 6
# ID var1 var2 var3 n frequency
# <dbl> <chr> <chr> <chr> <int> <dbl>
# 1 1 a b c 2 0.167
# 2 3 a b b 1 0.0833
# 3 4 a b d 1 0.0833
# 4 5 a b e 1 0.0833
# 5 6 a b f 1 0.0833
# 6 7 a b g 1 0.0833
# 7 8 a b h 1 0.0833
# 8 9 a b i 1 0.0833
# 9 10 a b j 1 0.0833
#10 11 a b k 2 0.167
Or make it compact with add_count
df %>%
add_count(var1, var2, var3) %>%
distinct(var1, var2, var3, .keep_all = TRUE) %>%
mutate(frequency = n/sum(n))
Or if we use the count
, then do a right_join
with the original data and then use distinct
df %>%
count(var1, var2, var3) %>%
mutate(frequency = n/sum(n)) %>%
right_join(df) %>%
distinct(var1, var2, var3, .keep_all = TRUE)