Search code examples
pythonjupyter-notebooknp-complete

why is my memory footprint blowing up in this greedy approach to tsp?


I have an assignment to use a greedy approach to satisfy TSP. The problem has 33708 cities. because I had a lot of helpful tools for this from the previous assignment, I decided to reuse that approach and precompute the distances.

so that is barely more than half a billion entries (33708 choose 2), each comfortably fitting in a float32. The x and y coordinates, likewise, are numbers $|n| < 10000 $ with no more than 4 decimal places.

My python for the same was:

def get_distance(left, right):
  """ return the euclidean distance between tuples left and right, which are coordinates"""
  return ((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5

# precompute all distances
distances = {}
for i in range(len(cities)):
  for j in range(i + 1, len(cities)):
    d = get_distance(cities[i], cities[j])
    distances[frozenset((i, j)))] = d

and I expected this to occupy (3 * 32b) * 568m ≈ 6.7 gigabytes of memory. But in fact, watching the live runtime in my jupyter notebook, it appears to be shooting past even 35GB. (442s and counting) I had to kill it as I was well into my swap space and it slowed down a lot. Anyone know why this is so surprisingly large?

update: trying again with tuple(sorted((i,j))) -- but already at 110s it is 15GB and counting

sizes

>>> import sys
>>> a = frozenset((1,2))
>>> sys.getsizeof(a)
216
>>> sys.getsizeof(tuple(sorted((1,2))))
56
>>> sys.getsizeof(1)
28

is there anything like float32 and int16 in python?? -- ans: numpy has them

updated attempt:

from numpy import float32, int16
from itertools import combinations
import sys

def get_distance(left, right):
  """ return the euclidean distance between tuples left and right, which are coordinates"""
  return float32(((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5)

# precompute all distances
distances = {}
for i, j in combinations(range(len(cities)), 2):
    distances[tuple(sorted((int16(i), int16(j))))] = get_distance(cities[i], cities[j])

print(sys.getsizeof(distances))

observed sizes:

  • with cities = cities[:2] : 232
  • with cities = cities[:3] : also 232
  • with cities = cities[:10] : 2272
  • with cities = cities[:100] : 147552
  • with cities = cities[:1000] : 20971608 (20MB)
  • with cities = cities[:10000] : 2684354656 (2.6GB)

note the growth rate does not scale with the data even as we approach 50 million entries ie 10000 choose 2 (10% of the total size of the data):

  • 2684354656/(1000 choose 2 / 100 choose 2 * 20971608) ≈ 1.27
  • 20971608/(1000 choose 2 / 100 choose 2 * 147552) ≈ 1.4

I decided to halt my attempt at the full cities list, as my OS snapshot of the memory grew to well over 30GB and I was going to swap. This means that, even if the final object ends up that big, the amount of memory the notebook is requiring is much larger still.


Solution

  • Python objects have an overhead because of dynamic typing and reference counting. The absolute minimal object object() has a size of 16 bytes (on 64 bit machines). 8 byte reference count, 8 bytes type pointer. No python object can be smaller than that. float and int are slightly larger which 24 bytes at least. list are at least an array of pointers, which adds an additional 8 bytes. So the small possible memory footprint of a list of half a billion ints is 32 * 500_000_000 ~= 16Gb. sets and dicts are even larger than that since they store more than just one pointer per element.

    Use numpy (maybe the stdlib array module is already enough).

    (Note: The numpy float32 types can't be smaller than 16 bytes either)