Identify which rows (or columns) have values in sparse Matrix

I need to identify the rows (/columns) that have defined values in a large sparse Boolean Matrix. I want to use this to 1. slice (actually view) the Matrix by those rows/columns; and 2. slice (/view) vectors and matrices that have the same dimensions as the margins of a Matrix. I.e. the result should probably be a Vector of indices / Bools or (preferably) an iterator.

I've tried the obvious:

a = sprand(10000, 10000, 0.01)
cols = unique(a.colptr)
rows = unique(a.rowvals)

but each of these take like 20ms on my machine, probably because they allocate about 1MB (at least they allocate cols and rows). This is inside a performance-critical function, so I'd like the code to be optimized. The Base code seems to have an nzrange iterator for sparse matrices, but it is not easy for me to see how to apply that to my case.

Is there a suggested way of doing this?

Second question: I'd need to also perform this operation on views of my sparse Matrix - would that be something like x = view(a,:,:); cols = unique(x.parent.colptr[x.indices[:,2]]) or is there specialized functionality for this? Views of sparse matrices appear to be tricky (cf https://discourse.julialang.org/t/slow-arithmetic-on-views-of-sparse-matrices/3644 – not a cross-post)

Thanks a lot!

Solution

Regarding getting the non-zero rows and columns of a sparse matrix, the following functions should be pretty efficient:

nzcols(a::SparseMatrixCSC) = collect(i 
  for i in 1:a.n if a.colptr[i]<a.colptr[i+1])

function nzrows(a::SparseMatrixCSC)
    active = falses(a.m)
    for r in a.rowval
        active[r] = true
    end
    return find(active)
end

For a 10_000x10_000 matrix with 0.1 density it takes 0.2ms and 2.9ms for cols and rows, respectively. It should also be quicker than method in question (apart from the correctness issue as well).

Regarding views of sparse matrices, a quick solution would be to turn view into a sparse matrix (e.g. using b = sparse(view(a,100:199,100:199))) and use functions above. In code:

nzcols(b::SubArray{T,2,P}) where {T,P<:AbstractSparseArray} = nzcols(sparse(b))
nzrows(b::SubArray{T,2,P}) where {T,P<:AbstractSparseArray} = nzrows(sparse(b))

A better solution would be to customize the functions according to view. For example, when the view uses UnitRanges for both rows and columns:

# utility predicate returning true if element of sorted v in range r
inrange(v,r) = searchsortedlast(v,last(r))>=searchsortedfirst(v,first(r))

function nzcols(b::SubArray{T,2,P,Tuple{UnitRange{Int64},UnitRange{Int64}}}
  ) where {T,P<:SparseMatrixCSC}
    return collect(i+1-start(b.indexes[2]) 
      for i in b.indexes[2]
      if b.parent.colptr[i]<b.parent.colptr[i+1] && 
        inrange(b.parent.rowval[nzrange(b.parent,i)],b.indexes[1]))
end

function nzrows(b::SubArray{T,2,P,Tuple{UnitRange{Int64},UnitRange{Int64}}}
  ) where {T,P<:SparseMatrixCSC}
    active = falses(length(b.indexes[1]))
    for c in b.indexes[2]
        for r in nzrange(b.parent,c)
            if b.parent.rowval[r] in b.indexes[1]
                active[b.parent.rowval[r]+1-start(b.indexes[1])] = true
            end
        end
    end
    return find(active)
end

which work faster than the versions for the full matrices (for 100x100 submatrix of above 10,000x10,000 matrix cols and rows take 16μs and 12μs, respectively on my machine, but these are unstable results).

A proper benchmark would use fixed matrices (or at least fix the random seed). I'll edit this line with such a benchmark if I do it.