50Kx50K sparse matrix

I need to hold a 50,000x50,000 sparse matrix/2d-array, with ~5% of the cells, uniformly distributed, being non-empty. I will need to:

edit I need to do this in numpy/scipy, sorry if wasn't clear. Also, added requirements.

Read the 5% non-empty data from a DB, and assign it to matrix/2d-array cells, as quickly as possible.
Use as little memory as possible.
Use fancy indexing (take the indexes of and all non-empty values in a column, say). This is nice-to-have, memory and construction-time as more important.
Once constructed, the matrix will not change.
I will, however, want to take its transpose, with preferably O(1) memory and time.

What's the most efficient way of achieving this? Can I hold nan's instead of zeros to indicate "empty" cells? (0 is a valid value for me), and can I efficiently run nansum, nanmean? If not, can I efficiently take the index of and values of all non-zeros in a given column/row?

Solution

Well, for my purposes it seems like csc is the way to go. With 5% "sparsity factor", the memory that the row indexes in csc take is still worth it. Here's the code I used to test that the stuff I need really is fast:

def build_csc(N, SPARSITY_FACTOR):

    data = []
    row_indexes = []
    column_indexes = [0] * (N+1)

    current_index = 0
    for j in xrange(N):
        column_indexes[j] = current_index
        for i in xrange(N):
            if random.random() < SPARSITY_FACTOR:
                row_indexes.append(i)
                data.append(random.random())
                current_index += 1
    column_indexes[N] = current_index

    return sp.csc_matrix((data,row_indexes,column_indexes), shape=(N,N), dtype=np.float)


def take_from_col(m, col_index):
    col = m[:,col_index]
    indexes = col.nonzero()[0]
    values = col[indexes]

Running this in %timeit shows that this is indeed fast.