Search code examples
pythonarraysdata-structuressparse-matrix

3D Array vs Sparse Matrix for Sparse Data


I am building a simple model of the Milky Way and one of the things I need to store is a 3D grid of mass densities.

The problem is that if I put a rectangular box around the galaxy, most of the grid cells are empty. Which leaves me saving a lot of useless zeros. So the naive array seems wasteful:

galaxy = [[[0 for k in xrange(1601)] for j in xrange(1601)] for i in xrange(253)]
# then fill in i,j,k values that are non-zero

I tried to build a sparse array using a dictionary:

for x in range(1601):
    for y in range(1601):
        for z in range (253):
            galaxy[str(x) + "," + str(y) + "," + str(z)] = # whatever

But, (aside from being ugly) the strings that I was using for keys were taking up more memory than I was saving. I got OutOfMemoryErrors because (I calculated) the keys alone were taking a couple of gigs of memory.

At some point, I will want to increase the resolution of my model, and that will mean a much larger grid. Is there a more efficient way to store my values than using a 3D-Array of floats?

I am also concerned about the time it takes to iterate through all the cells (or just the non-zero cells in my grid. This will be very important.


Solution

  • Quick math: 1601 * 1601 * 253 => 648489853 items. A test indicates that the dictionary takes about 24 bytes per entry on a 32-bit machine, 49 bytes on a 64-bit machine, so that's 15,563,756,472 bytes (or 30GB on 64-bit). 10% of that is 1.5GB (or 3.0GB on 64-bit). If you have a 64-bit system with a bunch of memory, I think you'll be okay with a sparse representation.

    I recommend:

    1. Use a tuple as the key, not a string, and
    2. Use a sparse storage system where you don't store zero values.

    Here is one possibility:

    class SparseDict(dict):
      def __init__(self, default_value):
        dict.__init__(self)
        self._value = default_value
      def __getitem__(self, key):
        try:
          return dict.__getitem__(self, key)
        except KeyError:
          return self._value
      def __setitem__(self, key, val):
        # I'm sure this can go faster if I were smarter
        if val == self._value:
          if  key in self:
            del self[key]
        else:
          dict.__setitem__(self, key, val)
    
    def test(galaxy):
      import sys
      print len(galaxy), sys.getsizeof(galaxy)
    
      # test is 1/10th size in each dimension,
      # so 1/1000th of the volume
      for x in range(160):
        for y in range(160):
          for z in range (25):
            import random
            # 90% of space is essentially a vacuum
            if random.random() < .1:
              galaxy[x,y,z] = 1502100
            else:
              galaxy[x,y,z] = 0
    
      print len(galaxy), sys.getsizeof(galaxy)
    
    test(SparseDict(0))