Search code examples
performancejuliaentropy

Joint Entropy Performance in Julia


I wrote a function to calculate the joint entropy of each column pair in a matrix. But I would like to increase the performance regarding time and memory.

The function looks like this:

function jointentropy(aln)
  mat = Array(Float64,size(aln,2),size(aln,2))

  for i in combinations(1:size(aln,2),2)
    a = i[1]
    b = i[2]
    mina, maxa = extrema(aln[:,a])
    minb, maxb = extrema(aln[:,b])
    h = Array(Float64,(maxa-mina+1,maxb-minb+1))
    h = hist2d([aln[:,a] aln[:,b]],mina-1:1:maxa,minb-1:1:maxb)[3]
    h = h/size(aln[:,1],1)
    I,J,V = findnz(h)
    l = sparse(I,J,log2(V),maxa-mina+1,maxb-minb+1)
    mat[b,a] = - sum(l.*h)
  end
  return mat
end

Matrices that go into this function look like this:

rand(45:122,rand(1:2000),rand(1:2000))

An example with a 500x500 matrix resulted in the following @time output:

elapsed time: 33.692081413 seconds (33938843192 bytes allocated, 36.42% gc time)

...which seems to be a whole lot of memory...

Any suggestions on how to speed up this function and reduce memory allocation?

Thanks in advance for any help!


Solution

  • Here are a few ideas to speed up your function.

    • If the range of all the columns is roughly the same, you can move the extrema computations outside the loop and reuse the same h array.

    • hist2d creates a new array: you can use hist2d! to reuse the previous one.

    • The assignment h = h/size(aln[:,1],1) creates a new array.

    • The division in h = h/size(aln[:,1],1) is done for all the elements of the array, including the zeroes.

    • You can use a loop instead of findnz and a sparse matrix (findnz already contains a loop).

    .

    function jointentropy2(aln)
      n1 = size(aln,1)
      n2 = size(aln,2)
      mat = Array(Float64,n2,n2) 
    
      lower, upper = extrema(aln)
      m = upper-lower+1
      h = Array(Float64,(m,m))
      for a in 1:n2
        for b in (a+1):n2
          Base.hist2d!(h,[aln[:,a] aln[:,b]],lower-1:1:upper,lower-1:1:upper)[3]
          s = 0
          for i in 1:m
            for j in 1:m
              if h[i,j] != 0
                p = h[i,j]  / n1
                s += p * log2(p)
              end
            end
          end
          mat[b,a] = - s
        end
      end
      return mat
    end
    

    This is twice as fast as the initial function, and the memory allocations were divided by 4.

    aln = rand(45:122,500,400)
    
    @time x = jointentropy(aln)
    # elapsed time: 26.946314168 seconds (21697858752 bytes allocated, 29.97% gc time)
    
    @time y = jointentropy2(aln)
    # elapsed time: 13.626282821 seconds (5087119968 bytes allocated, 16.21% gc time)
    
    x - y  # approximately zero (at least below the diagonal -- 
           # the matrix was not initialized above it)
    

    The next candidate for optimization is hist2d (here, you could use a loop and a sparse matrix).

    @profile jointentropy2(aln)
    Profile.print()