Search code examples
numpyscipysparse-matrix

50Kx50K sparse matrix


I need to hold a 50,000x50,000 sparse matrix/2d-array, with ~5% of the cells, uniformly distributed, being non-empty. I will need to:

edit I need to do this in numpy/scipy, sorry if wasn't clear. Also, added requirements.

  1. Read the 5% non-empty data from a DB, and assign it to matrix/2d-array cells, as quickly as possible.
  2. Use as little memory as possible.
  3. Use fancy indexing (take the indexes of and all non-empty values in a column, say). This is nice-to-have, memory and construction-time as more important.
  4. Once constructed, the matrix will not change.
  5. I will, however, want to take its transpose, with preferably O(1) memory and time.

What's the most efficient way of achieving this? Can I hold nan's instead of zeros to indicate "empty" cells? (0 is a valid value for me), and can I efficiently run nansum, nanmean? If not, can I efficiently take the index of and values of all non-zeros in a given column/row?


Solution

  • Well, for my purposes it seems like csc is the way to go. With 5% "sparsity factor", the memory that the row indexes in csc take is still worth it. Here's the code I used to test that the stuff I need really is fast:

    def build_csc(N, SPARSITY_FACTOR):
    
        data = []
        row_indexes = []
        column_indexes = [0] * (N+1)
    
        current_index = 0
        for j in xrange(N):
            column_indexes[j] = current_index
            for i in xrange(N):
                if random.random() < SPARSITY_FACTOR:
                    row_indexes.append(i)
                    data.append(random.random())
                    current_index += 1
        column_indexes[N] = current_index
    
        return sp.csc_matrix((data,row_indexes,column_indexes), shape=(N,N), dtype=np.float)
    
    
    def take_from_col(m, col_index):
        col = m[:,col_index]
        indexes = col.nonzero()[0]
        values = col[indexes]
    

    Running this in %timeit shows that this is indeed fast.