I have an assignment to use a greedy approach to satisfy TSP. The problem has 33708 cities. because I had a lot of helpful tools for this from the previous assignment, I decided to reuse that approach and precompute the distances.
so that is barely more than half a billion entries (33708 choose 2), each comfortably fitting in a float32. The x and y coordinates, likewise, are numbers $|n| < 10000 $ with no more than 4 decimal places.
My python for the same was:
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return ((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5
# precompute all distances
distances = {}
for i in range(len(cities)):
for j in range(i + 1, len(cities)):
d = get_distance(cities[i], cities[j])
distances[frozenset((i, j)))] = d
and I expected this to occupy (3 * 32b) * 568m ≈ 6.7 gigabytes of memory. But in fact, watching the live runtime in my jupyter notebook, it appears to be shooting past even 35GB. (442s and counting) I had to kill it as I was well into my swap space and it slowed down a lot. Anyone know why this is so surprisingly large?
update: trying again with tuple(sorted((i,j)))
-- but already at 110s it is 15GB and counting
>>> import sys
>>> a = frozenset((1,2))
>>> sys.getsizeof(a)
216
>>> sys.getsizeof(tuple(sorted((1,2))))
56
>>> sys.getsizeof(1)
28
is there anything like float32 and int16 in python?? -- ans: numpy has them
from numpy import float32, int16
from itertools import combinations
import sys
def get_distance(left, right):
""" return the euclidean distance between tuples left and right, which are coordinates"""
return float32(((left[0] - right[0]) ** 2 + (left[1] - right[1]) ** 2) ** 0.5)
# precompute all distances
distances = {}
for i, j in combinations(range(len(cities)), 2):
distances[tuple(sorted((int16(i), int16(j))))] = get_distance(cities[i], cities[j])
print(sys.getsizeof(distances))
observed sizes:
cities = cities[:2]
: 232cities = cities[:3]
: also 232cities = cities[:10]
: 2272cities = cities[:100]
: 147552cities = cities[:1000]
: 20971608 (20MB)cities = cities[:10000]
: 2684354656 (2.6GB)note the growth rate does not scale with the data even as we approach 50 million entries ie 10000 choose 2 (10% of the total size of the data):
I decided to halt my attempt at the full cities list, as my OS snapshot of the memory grew to well over 30GB and I was going to swap. This means that, even if the final object ends up that big, the amount of memory the notebook is requiring is much larger still.
Python objects have an overhead because of dynamic typing and reference counting. The absolute minimal object object()
has a size of 16
bytes (on 64 bit machines). 8
byte reference count, 8
bytes type pointer. No python object can be smaller than that. float
and int
are slightly larger which 24
bytes at least. list
are at least an array of pointers, which adds an additional 8
bytes. So the small possible memory footprint of a list of half a billion ints is 32 * 500_000_000
~= 16Gb. sets
and dicts
are even larger than that since they store more than just one pointer per element.
Use numpy
(maybe the stdlib array
module is already enough).
(Note: The numpy float32 types can't be smaller than 16 bytes either)