Search code examples
rdata.tablecountingresamplingsummarize

Count the occurence of an element in the group without summarizing


I have dataset that looks like this:

x <- data.table(id=c(1,1,1,2,2,3,4,4,4,4), cl=c("a","b","c","b","b","a","a","b","c","a"))

I am trying to find the probability of a row getting picked for each group (id) based on the elements in cl.

I tried the following:

x[,num:=.N, keyby=.(id,cl)]

x[,den:=.N, keyby=.(id)]

x[,prob:=num/den, ]

Is there a better way to do this?

Ultimately, my end goal was to use the probability values as weights while sampling a row per group (id). Any better alternatives to arrive at these weights would be greatly appreciated.


Solution

  • Do you meant something like this?

    > x[, prob := prop.table(table(cl))[cl], id][]
        id cl      prob
     1:  1  a 0.3333333
     2:  1  b 0.3333333
     3:  1  c 0.3333333
     4:  2  b 1.0000000
     5:  2  b 1.0000000
     6:  3  a 1.0000000
     7:  4  a 0.5000000
     8:  4  b 0.2500000
     9:  4  c 0.2500000
    10:  4  a 0.5000000
    

    or

    > unique(x[, prob := prop.table(table(cl))[cl], id][])
       id cl      prob
    1:  1  a 0.3333333
    2:  1  b 0.3333333
    3:  1  c 0.3333333
    4:  2  b 1.0000000
    5:  3  a 1.0000000
    6:  4  a 0.5000000
    7:  4  b 0.2500000
    8:  4  c 0.2500000
    

    Explanation: table + prop.table gives the frequencies table of all elements, which are named values, and thus we use [cl] to subset the frequencies.