Search code examples
memoryoctavesparse-matrixvariance

Octave: std on sparse matrix too memory intensive


I have a very large sparse matrix in Octave and I want to get the variance of each row. If I use std(A,1); it crashes because memory is exhausted. Why is this? The variance should be very easy to calculate for a sparse matrix, shouldn't it? How can I make this work?


Solution

  • If you want the standard deviation of just the nonzero entries in each column, then you can do:

    [nrows, ncols] = size(A);
    
    counts = sum(spones(A),1);
    
    means = sum(A,1) ./ max(counts, 1);
    [i,j,v] = find(A);
    v = means(j);
    placedmeans = sparse(i,j,v,nrows,ncols);
    
    vars = sum((A - placedmeans).^2, 1) ./ max(counts, 1);
    
    stds = sqrt(vars);
    

    I can't imagine a situation where you would want to take the standard deviations of all the terms in each column of a sparse matrix (including zeros), but if so, you only need to count the number of zeros in each column and include them in the calculations:

    [nrows,ncols] = size(A);
    
    zerocounts = nrows - sum(spones(A),1);
    
    means = sum(A,1) ./ nrows;
    [i,j,v] = find(A);
    v = means(j);
    placedmeans = sparse(i,j,v,nrows,ncols);
    
    vars = (sum((A - placedmeans).^2, 1) + zerocounts .* means.^2) ./ nrows;
    
    stds = sqrt(vars);
    

    Also, I don't know if you want to subtract one from the denominator of vars (counts and nrows respectively).

    EDIT: corrected a bug which reconstructs the placedmeans matrix of the wrong size whenever A ends in a row or column of all zeros. Also, the first case now returns a mean/var/std of zero whenever a column is all zeros (whereas before it would have been NaN)