I need to hold a 50,000x50,000 sparse matrix/2d-array, with ~5% of the cells, uniformly distributed, being non-empty. I will need to:
edit I need to do this in numpy/scipy, sorry if wasn't clear. Also, added requirements.
What's the most efficient way of achieving this? Can I hold nan's instead of zeros to indicate "empty" cells? (0 is a valid value for me), and can I efficiently run nansum, nanmean? If not, can I efficiently take the index of and values of all non-zeros in a given column/row?
Well, for my purposes it seems like csc is the way to go. With 5% "sparsity factor", the memory that the row indexes in csc take is still worth it. Here's the code I used to test that the stuff I need really is fast:
def build_csc(N, SPARSITY_FACTOR):
data = []
row_indexes = []
column_indexes = [0] * (N+1)
current_index = 0
for j in xrange(N):
column_indexes[j] = current_index
for i in xrange(N):
if random.random() < SPARSITY_FACTOR:
row_indexes.append(i)
data.append(random.random())
current_index += 1
column_indexes[N] = current_index
return sp.csc_matrix((data,row_indexes,column_indexes), shape=(N,N), dtype=np.float)
def take_from_col(m, col_index):
col = m[:,col_index]
indexes = col.nonzero()[0]
values = col[indexes]
Running this in %timeit
shows that this is indeed fast.