Search code examples
pythonnumpyopencvgoogle-colaboratory

RAM usage in dealing with numpy arrays and Python lists


I have memory issues and can't understand why. I'm using Google Colab, that gives me 12GB of RAM and let me see how the RAM usage is.

I'm reading np.array from files, and loading each array in a list.

database_list = list()
for filename in glob.glob('*.npy'):
  temp_img = np.load(filename)
  temp_img = temp_img.reshape((-1, 64)).astype('float32')
  temp_img = cv2.resize(temp_img, (64, 3072), interpolation=cv2.INTER_LINEAR)
  database_list.append(temp_img)

The code print("INTER_LINEAR: %d bytes" % (sys.getsizeof(database_list))) prints:

INTER_LINEAR: 124920 bytes

It is the same value for arrays reshaped as 64x64, 512x64, 1024x64, 2048x64 and for 3072x64. But if I reshape these arrays as 4096x64, I get an error, for too much RAM used.

With arrays of 3072x64 I can see the RAM usage get higher and higher and then going back down.

My final goal is to zero-padding each array to a dimension of 8192x64, but my session crash before; but this is another problem.

How is the RAM used? Why, if the arrays have different dimensions, the list has the same size? How python is loading and manipulating this file, that explains the RAM usage history?

EDIT:

Does then

sizeofelem = database_list[0].nbytes 
#all arrays have now the same dimensions MxN, so despite its content, they should occupy the same memory
total_size = sizeofelem * len(database_list)

work and total_sizereflects the correct size of the list?


Solution

  • I've got the solution.

    First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:

    (database_list[0].nbytes * len(database_list) / 1000000, "MB")
    

    where database_list[0].nbytes is reliable as all the elements in database_list have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).

    To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:

    database_list = list()
    for filename in glob.glob('*.npy'):
      temp_img = np.load(filename)
      temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
      database_list.append(temp_img)
    

    Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0 (NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64.