I have a very large (millions of points) connected graph and many potential segmentation algorithms to determine group membership. Is there an existing implementation in sets or a similar R package for calculating a consensus set among possible ensembles.
An example:
Let's say I had 10 total points and three algorithms that choose groups and members.
> algorithm1<-list(c(1,2,3),c(4,5,6),c(7,8,9,10))
> algorithm2<-list(c(1,2,3),c(4,6),c(5,7,8,9,10))
> algorithm3<-list(c(1,2,3),c(4,6),c(5,7,8),c(9,10))
> algorithm1
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
[[3]]
[1] 7 8 9 10
> algorithm2
[[1]]
[1] 1 2 3
[[2]]
[1] 4 6
[[3]]
[1] 5 7 8 9 10
> algorithm3
[[1]]
[1] 1 2 3
[[2]]
[1] 4 6
[[3]]
[1] 5 7 8
[[4]]
[1] 9 10
All three algorithms agree that there is a membership among 1,2,3, but the remaining groups need a majority rule algorithm to decide the minimum number of groups that minimize the loss compared to the input groups. This feels like an area of permutation/combinatorics that is probably resolved. This is not my area, I need a push in the right direction.
One, incomplete, thing i've considered is to generate pairwise links among members with the link strength equal to the number of times that pair of points is included in a set.
> library(reshape2)
>
> pairwise_count<-function(x){
+
+ #For each group, get all pairwise combination of members
+ m<-lapply(x,function(y){
+ as.data.frame(t(combn(y,2)))
+ })
+
+ #Bind groups into a dataframe and give it a count column
+ df<-bind_rows(m)
+ colnames(df)<-c("Point1","Point2")
+ return(df)
+ }
>
> #Example
> pairwise_count(algorithm1)
Point1 Point2
1 1 2
2 1 3
3 2 3
4 4 5
5 4 6
6 5 6
7 7 8
8 7 9
9 7 10
10 8 9
11 8 10
12 9 10
> #Compute for all algorithms
> alldf<-list(algorithm1=pairwise_count(algorithm1),algorithm2=pairwise_count(algorithm2),algorithm3=pairwise_count(algorithm3))
> alldf<-melt(alldf,id.vars=c("Point1","Point2"))
>
> #Get consensus probability that a pair are in the same set.
> library(dplyr)
> alldf %>% group_by(Point1,Point2) %>% summarize(n=n()/3)
# A tibble: 16 x 3
# Groups: Point1 [?]
Point1 Point2 n
<dbl> <dbl> <dbl>
1 1. 2. 1.00
2 1. 3. 1.00
3 2. 3. 1.00
4 4. 5. 0.333
5 4. 6. 1.00
6 5. 6. 0.333
7 5. 7. 0.667
8 5. 8. 0.667
9 5. 9. 0.333
10 5. 10. 0.333
11 7. 8. 1.00
12 7. 9. 0.667
13 7. 10. 0.667
14 8. 9. 0.667
15 8. 10. 0.667
16 9. 10. 1.00
>
> # How to choose final sets?
Edit #1 The following code reproduces the function above.
library(reshape2)
library(dplyr)
algorithm1<-list(c(1,2,3),c(4,5,6),c(7,8,9,10))
algorithm2<-list(c(1,2,3),c(4,6),c(5,7,8,9,10))
algorithm3<-list(c(1,2,3),c(4,6),c(5,7,8),c(9,10))
pairwise_count<-function(x){
#For each group, get all pairwise combination of members
m<-lapply(x,function(y){
as.data.frame(t(combn(y,2)))
})
#Bind groups into a dataframe and give it a count column
df<-bind_rows(m)
colnames(df)<-c("Point1","Point2")
return(df)
}
#Example
pairwise_count(algorithm1)
#Compute for all algorithms
alldf<-list(algorithm1=pairwise_count(algorithm1),algorithm2=pairwise_count(algorithm2),algorithm3=pairwise_count(algorithm3))
alldf<-melt(alldf,id.vars=c("Point1","Point2"))
#Get consensus probability that a pair are in the same set.
alldf %>% group_by(Point1,Point2) %>% summarize(n=n()/3)
# How to choose final sets?
So here's something that might get you closer to your goal.. the diceR
package implements "consensus clustering". Here's how it would work on your sample. This says "do consensus clustering on my three groups of clusters, don't resample subsets of the data or else lots of NA
will show up, reps
to determine how many times to iterate, and then you can also pick one or more clustering algorithms (e.g. hc
, pam
, diana
, km
) and one or more distance metrics (distance=c("euclidean","minkowski")
).
Here's a really short example on your data, asking for "three or four clusters" in the final solution and just using hierarchical clustering with euclidean distance (the default):
> library(diceR)
> df <- cbind(unlist(algorithm1),unlist(algorithm2),unlist(algorithm3))
> consensus_cluster(df, nk=3:4, p.item=1, reps=10, algorithms=c("hc"))
, , HC_Euclidean, 3
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
1 1 1 2 2 3 3 2 1 3 1
2 1 1 2 2 3 3 2 1 3 1
3 1 1 2 2 3 3 2 1 3 1
4 1 1 2 2 3 3 2 1 3 1
5 2 2 3 1 2 1 3 3 1 2
6 2 2 3 1 2 1 3 3 1 2
7 3 3 1 3 1 2 1 2 2 3
8 3 3 1 3 1 2 1 2 2 3
9 3 3 1 3 1 2 1 2 2 3
10 3 3 1 3 1 2 1 2 2 3
, , HC_Euclidean, 4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
1 1 4 2 2 3 4 3 1 4 1
2 1 1 2 2 3 4 3 1 4 1
3 1 1 4 2 3 3 2 1 3 1
4 1 1 4 2 3 3 2 1 3 1
5 2 2 3 1 2 1 4 3 1 2
6 2 2 3 1 2 1 4 3 1 2
7 3 3 1 4 1 2 1 2 2 3
8 3 3 1 4 1 2 1 2 2 3
9 4 3 1 3 1 2 1 2 2 4
10 4 3 1 3 4 2 1 4 2 4
The last column shows the cluster assignments on the final iteration. You're going to have to play around with the clustering approach, the distance metric, and the number of reps to find an approach that works for you, but there's a vignette here that will help you do that: https://cran.r-project.org/web/packages/diceR/vignettes/overview.html
There's also an academic paper here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5769335/
Hope this is a little helpful :)