python r sparse-matrix large-data bigdata

Accessing a large number of unsorted array elements in Python

I'm not very skilled in Python. However, I'm pretty handy with R. Yet, I do have to use Python since it has an up-to-date interface with Cplex. I'm also trying to avoid all the extra coding I would have to do in C/C++

That being said, I have issues with speed and efficiency on large lists/arrays/matrices...

I quickly wrote two lines in R, which are pretty ugly but work somewhat well...

doses = sapply(organSets[[2]], function(x) sum(voxMap_beamlet_val[which(voxMap_beamlet_iInd[,1] == x),1]))
length(which(doses <= sum(doses)/length(organSets[[2]])))/length(doses)

... these lines execute in about 5 minutes for length(organSets[[2]])=52960 and length(voxMap_beamlet_val) = length(voxMap_beamlet_iInd) = 1217077

However, in Python, something similar takes about two hours. By using scipy.sparse, the execution time gets reduced by half, to about an hour... which is still obviously unacceptable...

Python code as follows...

voxMapBeamlet = sparse.coo_matrix((voxMap_beamlet_val,(voxMap_beamlet_iInd,voxMap_beamlet_jInd)),shape=(1055736,8500))
def getDose(voxInd):
    return sum([x.sum() for x in voxMapBeamlet.getrow(voxInd)])

def probDoseLevel2(orgVox,level=None):
    voxDose2 = [0] * len(orgVox)
    for i,v in enumerate(orgVox):
        voxDose2[i] = getDose(v)
    if level == None:
        level = sum(voxDose2)/len(orgVox)
    print len([x for x in voxDose2 if x <= level])/len(orgVox)

probDoseLevel2(organSets[1])

Please advise.

Solution

First, you want efficient access to rows, and COO format doesn't do that. Convert your voxMapBeamlet to Compressed Sparse Row format, and getrow becomes much more efficient:

voxMapBeamlet = sparse.coo_matrix((voxMap_beamlet_val,(voxMap_beamlet_iInd,voxMap_beamlet_jInd)),shape=(1055736,8500))
voxMapBeamlet = voxMapBeamlet.tocsr()

Second, getDose is a lot more complicated than it needs to be:

def getDose(voxInd):
    return voxMapBeamlet.getrow(voxInd).sum()

I suspect this will already be fast enough, but we should be able to push it further. We can push more of the work out of Python bytecode and into C-level routines using broadcasting and advanced indexing:

def probDoseLevel2(orgVox,level=None):
    relevantRows = voxMapBeamlet[orgVox]

    # Dense column matrix of row sums
    voxDose2 = relevantRows.sum(axis=1)
    if level is None:
        level = voxDose2.sum()/len(orgVox)
    return (voxDose2 <= level).sum() / len(orgVox)

Finally, if you're on Python 2, the division in your last line is integer division. That'll always produce 0 or 1 here, since len(orgVox) is at least as big as the numerator. There may be a similar problem with the division to produce level, depending on the dtype of the data you're working with. To turn on true division, you can put

from __future__ import division

at the top of your file.