Search code examples
rcombinationsadjacency-matrix

R: Update adjacency matrix/data frame using pairwise combinations


Question


Let's say I have this dataframe:

# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)

Will be something like:

     cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.

Idea


# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations

What I tried


# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible


# what I tried
update.adj <- function( data ) {
  for (comb in combn(data$letters,2)) {
    # stucked
  }
}

test.set %>% group_by(cluster.id) %>% update.adj(.)

Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear


Answer to comment
Answer to @Manuel Bickel: For the data I gave as example (the table under "will be something like"): This matrix will be A-->Z for the full dataset, keep that in mind.

  A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

I will explain what I did:

    cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):

 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E

Now I look for each cluster what pairwise combinations I can make:

cluster 3:

A,E
A,D
A,C
E,D
E,C
D,C

Update these combination in the adjacency matrix:

    A B C D E
    A 0 0 1 1 1
    B 0 0 0 0 0
    C 1 0 0 1 1
    D 1 0 1 0 1
    E 2 0 1 1 0

Then go to the next cluster

cluster 2

A,E

Update the adjacency matrix again:

 A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

As reaction to the huge dataset

library(reshape2)

test.set <- read.table(text = "
                            cluster.id   letters
                       1          5       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)

x1 <- reshape2::dcast(test.set, cluster.id ~ letters)

x1
#cluster.id A B C D E
#1          1 1 0 0 0 0
#2          2 1 0 0 0 1
#3          3 1 0 1 1 1
#4          4 0 2 0 0 0
#5          5 1 0 0 0 0

x2 <- table(test.set)

x2
#          letters
#cluster.id A B C D E
#         1 1 0 0 0 0
#         2 1 0 0 0 1
#         3 1 0 1 1 1
#         4 0 2 0 0 0
#         5 1 0 0 0 0


x1.c <- crossprod(x1)
#Error in crossprod(x, y) : 
#  requires numeric/complex matrix/vector arguments

x2.c <- crossprod(x2)
#works fine

Solution

  • Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.

    UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.

    test.set <- read.table(text = "
                                cluster.id   letters
                           1          5       A
                           2          4       B
                           3          4       B
                           4          3       A
                           5          3       E
                           6          3       D
                           7          3       C
                           8          2       A
                           9          2       E
                           10          1       A", header = T, stringsAsFactors = F)
    
    x <- table(test.set)
    x
              letters
    #cluster.id A B C D E
    #         1 1 0 0 0 0
    #         2 1 0 0 0 1
    #         3 1 0 1 1 1
    #         4 0 2 0 0 0
    #         5 1 0 0 0 0
    
    #base approach, based on answer by Tyler Rinker
    x <- crossprod(x)
    diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
    x
    
    #         letters
    # letters 
    #         A B C D E
    #       A 0 0 1 1 2
    #       B 0 0 0 0 0
    #       C 1 0 0 1 1
    #       D 1 0 1 0 1
    #       E 2 0 1 1 0
    
    #reshape2 approach
    x <- acast(test.set, cluster.id ~ letters)
    x <- crossprod(x)
    diag(x) <- 0
    x
    #   A B C D E
    # A 0 0 1 1 2
    # B 0 0 0 0 0
    # C 1 0 0 1 1
    # D 1 0 1 0 1
    # E 2 0 1 1 0