Search code examples
rsparse-matrixcontiguous

How to find and name contiguous non-zero entries in a sparse matrix in R?


My problem is conceptually simple. I am looking for a computationally efficient solution of it (my own one I attach at the end).

Suppose we have a potentially very large sparse matrix like the one on the left below and want to 'name' every area of contiguous non-zero elements with a separate code (see matrix on the right)

1 1 1 . . . . .          1 1 1 . . . . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
. . . . 1 1 . .   --->   . . . . 4 4 . .
. . 1 1 . . 1 1          . . 3 3 . . 7 7
1 . 1 1 . . 1 1          2 . 3 3 . . 7 7
1 . . . 1 . . .          2 . . . 5 . . .
1 . . . . 1 1 1          2 . . . . 6 6 6

In my application the contiguous elements would form rectangles, lines or single points and they can only touch each other with the vertexes (i.e. there would be no irregular/non rectangular areas in the matrix).

The solution I imagined is to match the row and column indexes of the sparse matrix representation to a vector with the appropriate values (the 'name' codes). My solution uses several for loops and works fine for small to medium matrices but will fast get stuck in the loops as the dimensions of the matrix become large (>1000). It probably depends from the fact I'm not so advanced in R programming - I couldn't find any computational trick/function to solve it better.

Can anybody suggest a computationally more efficient way to do that in R?

My solution:

mySolution <- function(X){

  if (class(X) != "ngCMatrix") {stop("Input must be a Sparse Matrix")}
  ind <- which(X == TRUE, arr.ind = TRUE)
  r <- ind[,1]
  c <- ind[,2]

  lr <- nrow(ind)
  for (i in 1:lr) {
    if(i == 1) {bk <- 1}
    else {
      if (r[i]-r[i-1] == 1){bk <- c(bk, bk[i-1])}
      else {bk <- c(bk, bk[i-1]+1)}
    }
  }

  for (LOOP in 1:(lr-1)) {
    tr <- r[LOOP]
    tc <- c[LOOP]
    for (j in (LOOP+1):lr){
      if (r[j] == tr) {
        if(c[j] == tc + 1) {bk[j] <- bk[LOOP]} 
      }
    }
  }

  val <- unique(bk)
  for (k in 1:lr){
    bk[k] <- which(val==bk[k])
  }

  return(sparseMatrix(i = r, j = c, x = bk))
}

Thanks in advance for any help or pointer.


Solution

  • Relying heavily on the fact that all neighbouring elements to be grouped form only rectangles/lines/points, we see that elements of the matrix can be aggregated based on their [row, col] indices on the matrix by the relation (abs(row1 - row2) + abs(col1 - col2)) < 2.

    So, starting with the [row, col] indices:

    sm = as.matrix(summary(m))
    

    We compute their distance, which -as noted by GiuGe- is actually the "manhattan" method:

    d = dist(sm, "manhattan")
    

    Single-linkage's property in clustering elements on their nearest neighbour is useful here. Also, we can get a grouping of the elements by cutreeing on "h = 1" (where the distance of indices is "< 2"):

    gr = cutree(hclust(d, "single"), h = 1)
    

    Finally, we can wrap the above in a new sparse matrix:

    sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)
    #8 x 8 sparse Matrix of class "dgCMatrix"
    #                    
    #[1,] 1 1 1 . . . . .
    #[2,] 1 1 1 . 4 4 . .
    #[3,] 1 1 1 . 4 4 . .
    #[4,] . . . . 4 4 . .
    #[5,] . . 3 3 . . 7 7
    #[6,] 2 . 3 3 . . 7 7
    #[7,] 2 . . . 5 . . .
    #[8,] 2 . . . . 6 6 6
    

    "m" used is:

    library(Matrix)
    m = new("ngCMatrix"
        , i = c(0L, 1L, 2L, 5L, 6L, 7L, 0L, 1L, 2L, 0L, 1L, 2L, 4L, 5L, 4L, 
    5L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 7L, 4L, 5L, 7L, 4L, 5L, 7L)
        , p = c(0L, 6L, 9L, 14L, 16L, 20L, 24L, 27L, 30L)
        , Dim = c(8L, 8L)
        , Dimnames = list(NULL, NULL)
        , factors = list()
    )
    

    EDIT Feb 10 '17

    Another idea (and, again, considering the fact that neighbouring elements form only rectangles/lines/points) is to iterate -in ascending columns- through the [row, col] indices and, in each step, find the distance of each element of its nearest neighbour in current column and row. If a "< 2" distance is found then the element is grouped with its neighbour, else a new group is started. Wrapped into a function:

    ff = function(x) 
    {
        sm = as.matrix(summary(x))
    
        gr = integer(nrow(sm)); ngr = 0L ; gr[1] = ngr 
    
        lastSeenRow = integer(nrow(x))
        lastSeenCol = integer(ncol(x))
    
        for(k in 1:nrow(sm)) {
            kr = sm[k, 1]; kc = sm[k, 2]
            i = lastSeenRow[kr]
            j = lastSeenCol[kc]
    
            if(i && (abs(kc - sm[i, 2]) == 1)) gr[k] = gr[i]
            else if(j && (abs(kr - sm[j, 1]) == 1)) gr[k] = gr[j]  
                 else { ngr = ngr + 1L; gr[k] = ngr } 
    
            lastSeenRow[kr] = k
            lastSeenCol[kc] = k        
        }
    
        sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)                 
    }                  
    

    And applied on "m":

    ff(m)
    #8 x 8 sparse Matrix of class "dgCMatrix"
    #                    
    #[1,] 1 1 1 . . . . .
    #[2,] 1 1 1 . 4 4 . .
    #[3,] 1 1 1 . 4 4 . .
    #[4,] . . . . 4 4 . .
    #[5,] . . 3 3 . . 7 7
    #[6,] 2 . 3 3 . . 7 7
    #[7,] 2 . . . 5 . . .
    #[8,] 2 . . . . 6 6 6
    

    Also, it's convenient that both functions return groups in the same order, as we can check:

    identical(mySolution(m), ff(m))
    #[1] TRUE
    

    On a, seemingly, more complex example:

    mm = new("ngCMatrix"
        , i = c(25L, 26L, 27L, 25L, 29L, 25L, 25L, 17L, 18L, 26L, 3L, 4L, 5L, 
    14L, 17L, 18L, 25L, 27L, 3L, 4L, 5L, 17L, 18L, 23L, 26L, 3L, 
    4L, 5L, 10L, 17L, 18L, 9L, 11L, 17L, 18L, 10L, 17L, 18L, 3L, 
    17L, 18L, 21L, 17L, 18L, 17L, 18L, 1L, 2L, 3L, 4L, 16L, 8L, 17L, 
    18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 7L, 9L, 10L, 11L, 26L, 
    8L, 27L, 1L, 2L, 28L, 1L, 2L, 15L, 27L, 1L, 2L, 21L, 22L, 1L, 
    2L, 7L, 21L, 22L, 1L, 2L, 6L, 24L, 1L, 2L, 5L, 11L, 16L, 25L, 
    26L, 27L, 4L, 15L, 17L, 19L, 25L, 26L, 27L, 3L, 16L, 25L, 26L, 
    27L, 2L, 28L, 1L)
        , p = c(0L, 0L, 3L, 3L, 5L, 6L, 7L, 7L, 10L, 18L, 25L, 31L, 35L, 38L, 
    42L, 44L, 46L, 51L, 61L, 66L, 68L, 71L, 75L, 79L, 84L, 88L, 96L, 
    103L, 108L, 110L, 111L)
        , Dim = c(30L, 30L)
        , Dimnames = list(NULL, NULL)
        , factors = list()
    )
    identical(mySolution(mm), ff(mm))
    #[1] TRUE
    

    And a simple benchmark on a larger matrix:

    times = 30 # times `dim(mm)`
    MM2 = do.call(cbind, rep_len(list(do.call(rbind, rep_len(list(mm), times))), times))
    dim(MM2)
    #[1] 900 900
    
    system.time({ ans1 = mySolution(MM2) })
    #   user  system elapsed 
    # 449.50    0.53  463.26
    
    system.time({ ans2 = ff(MM2) })
    #   user  system elapsed 
    #   0.51    0.00    0.52
    
    identical(ans1, ans2)
    #[1] TRUE