I have dataset that looks like this:
x <- data.table(id=c(1,1,1,2,2,3,4,4,4,4), cl=c("a","b","c","b","b","a","a","b","c","a"))
I am trying to find the probability of a row getting picked for each group (id) based on the elements in cl.
I tried the following:
x[,num:=.N, keyby=.(id,cl)]
x[,den:=.N, keyby=.(id)]
x[,prob:=num/den, ]
Is there a better way to do this?
Ultimately, my end goal was to use the probability values as weights while sampling a row per group (id). Any better alternatives to arrive at these weights would be greatly appreciated.
Do you meant something like this?
> x[, prob := prop.table(table(cl))[cl], id][]
id cl prob
1: 1 a 0.3333333
2: 1 b 0.3333333
3: 1 c 0.3333333
4: 2 b 1.0000000
5: 2 b 1.0000000
6: 3 a 1.0000000
7: 4 a 0.5000000
8: 4 b 0.2500000
9: 4 c 0.2500000
10: 4 a 0.5000000
or
> unique(x[, prob := prop.table(table(cl))[cl], id][])
id cl prob
1: 1 a 0.3333333
2: 1 b 0.3333333
3: 1 c 0.3333333
4: 2 b 1.0000000
5: 3 a 1.0000000
6: 4 a 0.5000000
7: 4 b 0.2500000
8: 4 c 0.2500000
Explanation: table
+ prop.table
gives the frequencies table of all elements, which are named values, and thus we use [cl]
to subset the frequencies.