I have a 3 columns file of about 28Gb. I would like to read it with python and put its content in a list of 3D tuples. Here's the code I'm using :
f = open(filename)
col1 = [float(l.split()[0]) for l in f]
f.seek(0)
col2 = [float(l.split()[1]) for l in f]
f.seek(0)
col3 = [float(l.split()[2]) for l in f]
f.close()
rowFormat = [col1,col2,col3]
tupleFormat = zip(*rowFormat)
for ele in tupleFormat:
### do something with ele
There's no 'break' command in the for loop, meaning that I actually read the whole content of the file. When the script is being run, I notice from the 'htop' command that it takes 156G of virtual memory (VIRT column) and almost the same amount for the resident memory (RES column). Why is my script using 156G whereas the file size is only 28G ?
Python objects have a lot of overheard, e.g., reference count to the object and other stuff. That means that a Python float is more than 8 bytes. On my 32bit Python version, it is
>>> import sys
>>> print(sys.getsizeof(float(0))
16
A list has its own overhead and then requires 4 bytes per element to store a reference to that object. So 100 floats in a list actually take up a size of
>>> a = map(float, range(100))
>>> sys.getsizeof(a) + sys.getsizeof(a[0])*len(a)
2036
Now, a numpy array is different. It has a little bit of overhead, but the raw data under the hood are stored like in C.
>>> import numpy as np
>>> b = np.array(a)
>>> sys.getsizeof(b)
848
>>> b.itemsize # number of bytes per element
8
So a Python float requires 20 bytes compared to 8 for numpy. And 64bit Python versions require even more.
So really, if you must load A LOT of data in memory, numpy is one way to go. Looking at the way you load the data, I assume it's in text format with 3 floats per row, split by an arbitrary number of spaces. In that case, you could simply will use numpy.genfromtxt()
data = np.genfromtxt(fname, autostrip=True)
You could also look for more options here, e.g., mmap, but I don't know much about it to say whether it'd be more appropriate for you.