Search code examples
rgroup-bydata.tableprobabilityk-means

How to calculate the probability value for each observation when distances are known - > kmeans R programming


I am new to R programming and trying to figure out the following. Following table contains the Euclidean distances and cluster details for each observation. There are more than 100000 different ids in the table and for each id value distance has been calculated for each cluster. There are 6 clusters named 1-6.

I need to calculate the final column which should be the probability value of each observation belonging to that cluster. This is given by for 1st entry ,

p1 = 1 / (())

where the denominator when expanded is enter image description here

Each probability value is calculated based on the 6 distance values for that id. The table is in a data.table format. I wanted to try something like this. but I even don't know how to complete that line.

dt_calc[, prob_value := (1 / (distance/dt_calc[distance, by = .(id, cluster== 1 )]) ^ 2), by = id]

id cluster distance prob_value
1 1 d1 p1
1 2 d2 ?
1 3 d3 ----
1 4 d4 ----
1 5 d5 ----
1 6 d6 ----
2 1 d7 ----
2 2 d8 ----
2 3 d9 ----
2 4 d10 ----
2 5 d11 ----
2 6 d12 ----

Can someone pleasse show me how to calculate this prob_value column.


Solution

  • I'm not certain how efficient by = .EACHI is here, but this seems to work. I can't figure out why things go wrong when assigning by reference, so I dump it into a new data.table, but this might at least get you somewhere.

    dt   <- data.table(id = rep(c(1, 2), each = 6),
                       cluster = rep(1:6, 2),
                       distance = sample(100, size = 12, replace = TRUE))
    
    test <- dt[dt, 1/sum((i.distance/distance) ^ 2),
               on = .(id), by = .EACHI]