Search code examples
pythonrmatrixcluster-analysis

Forming a symmetric matrix counting instances of being in same cluster


I have a database that comprises cities divided into clusters for each year. In other words, I applied a community detection algorithm for different databases containing cities in different years base on modularity. The final database (a mock example) looks like this:

v1 city cluster year
0 "city1"  0  2000 
1 "city2"  2. 2000
2 "city3"  1. 2000
3 "city4"  0  2000
4 "city5"  2  2000
0 "city1"  2  2001
1 "city2"  1  2001
2 "city3"  0  2001
3 "city4"  0  2001
4 "city5"  0  2001
0 "city1"  1  2002
1 "city2"  2  2002
2 "city3"  0  2002
3 "city4"  0  2002
4 "city5"  1  2002

Now what would like to do is counting how many times a city ends up in the same cluster as another city each year. So in the mock example above I should end up with a 5 times 5 symmetric matrix where rows and columns are cities where each entry represent the number of times that city I and j are in the same cluster (independently of which cluster) in all years:


       city1 city2 city3 city4 city5
city1   .     0.    0.     1.    1
city2.  0.    .     0.     0.    1
city3.  0.    0.    .      2.    1
city4.  1.    0.    2      .     1.  
city5.  1.    1     1.     1.    .

I am working in python but it's fine even if the solution is in matlab or R.

Thank you


Solution

  • In R, co-occurrence matrices are computed straightforwardly with table and [t]crossprod. We can compute the matrices by year and take the sum, like so:

    con <- textConnection('
    v1 city cluster year
    0 "city1" 0 2000 
    1 "city2" 2 2000
    2 "city3" 1 2000
    3 "city4" 0 2000
    4 "city5" 2 2000
    0 "city1" 2 2001
    1 "city2" 1 2001
    2 "city3" 0 2001
    3 "city4" 0 2001
    4 "city5" 0 2001
    0 "city1" 1 2002
    1 "city2" 2 2002
    2 "city3" 0 2002
    3 "city4" 0 2002
    4 "city5" 1 2002
    ')
    d <- read.table(con, header = TRUE)
    close(con)
    
    x <- with(d, Reduce(`+`, apply(table(city, cluster, year), 3L, tcrossprod, simplify = FALSE)))
    x
    
           city
    city    city1 city2 city3 city4 city5
      city1     3     0     0     1     1
      city2     0     3     0     0     1
      city3     0     0     3     2     1
      city4     1     0     2     3     1
      city5     1     1     1     1     3
    

    There are threes on the diagonal because cities match themselves every year. If you prefer, say, zeros on the diagonal, then you can add:

    diag(x) <- 0
    

    If you don't like the redundant annotation with "city", then you can add:

    dimnames(x) <- unname(dimnames(x))
    

    And if you want to store the result as a formally symmetric, formally sparse matrix, then you can add:

    library(Matrix)
    x <- as(x, "CsparseMatrix")
    x
    
    5 x 5 sparse Matrix of class "dsCMatrix"
          city1 city2 city3 city4 city5
    city1     .     .     .     1     1
    city2     .     .     .     .     1
    city3     .     .     .     2     1
    city4     1     .     2     .     1
    city5     1     1     1     1     .