I've got a similarity matrix between all cases and, in a separate data frame, classes of these cases. I want to compute average similarity between cases from the same class, here is the equation for an example n from class j:
We have to compute a sum of all squared proximities between n and all cases k that come from the same class as n. Link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers
I implemented that with 2 for loops, but it is really slow. Is there a faster way to do such thing in R?
Thanks.
//DATA (dput)
Data frame with classes:
structure(list(class = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 1L,
1L, 2L, 3L), .Label = c("1", "2", "3", "5", "6", "7"), class = "factor")), .Names = "class", row.names = c(NA,
-10L), class = "data.frame")
Proximity matrix (row m and column m correspond to class in row m of data frame above):
structure(c(1, 0.60996875, 0.51775, 0.70571875, 0.581375, 0.42578125,
0.6595, 0.7134375, 0.645375, 0.468875, 0.60996875, 1, 0.77021875,
0.55171875, 0.540375, 0.53084375, 0.4943125, 0.462625, 0.7910625,
0.56321875, 0.51775, 0.77021875, 1, 0.451375, 0.60353125, 0.62353125,
0.5203125, 0.43934375, 0.6909375, 0.57159375, 0.70571875, 0.55171875,
0.451375, 1, 0.69196875, 0.59390625, 0.660375, 0.76834375, 0.606875,
0.65834375, 0.581375, 0.540375, 0.60353125, 0.69196875, 1, 0.7194375,
0.684, 0.68090625, 0.50553125, 0.60234375, 0.42578125, 0.53084375,
0.62353125, 0.59390625, 0.7194375, 1, 0.53665625, 0.553125, 0.513,
0.801625, 0.6595, 0.4943125, 0.5203125, 0.660375, 0.684, 0.53665625,
1, 0.8456875, 0.52878125, 0.65303125, 0.7134375, 0.462625, 0.43934375,
0.76834375, 0.68090625, 0.553125, 0.8456875, 1, 0.503, 0.6215,
0.645375, 0.7910625, 0.6909375, 0.606875, 0.50553125, 0.513,
0.52878125, 0.503, 1, 0.60653125, 0.468875, 0.56321875, 0.57159375,
0.65834375, 0.60234375, 0.801625, 0.65303125, 0.6215, 0.60653125,
1), .Dim = c(10L, 10L))
Correct result:
c(2.44197227050781, 2.21901680175781, 2.07063155175781, 2.52448621289062,
1.88040830957031, 2.16019295703125, 2.58622273828125, 2.81453253222656,
2.1031745078125, 2.00542063378906)
Should be possible. Your notation does not make clear whether we will find members of like classes in the rows or columns, so this answer presumes in the columns but the obvious modifications would work as well if they were in rows.
colSums(mat^2)) # in R this is element-wise application of ^2 rather than matrix multiplication.
Since both operations are vectorized it would be expected to be much faster than for-loops.
With the modification and assuming the matrix is named 'mat' and the class-dataframe named 'cldf':
sapply( 1:nrow(mat) ,
function(r) sum(mat[r, cldf[['class']][r] == cldf[['class']] ]^2) )
[1] 2.441972 2.219017 2.070632 2.524486 1.880408 2.160193 2.586223 2.814533 2.103175 2.005421