python numpy memory memory-leaks train-test-split

np.load() takes too much memory

I'm very new to python and machine learning in general and been working with a school-related project recently. I'm currently stuck with this code block as it is been taking too much memory space when I load my npy files.

features_list = []
labels_list = []

for i in range(9221):
    npy_path = f'E:/padFrames/video_{i}.npy'
    frames_array = np.load(npy_path, mmap_mode='r')
    
    # Append frames to the features list
    features_list.append(frames_array)

    # Extract labels for the current video
    labels = df.iloc[i][categoryEmotions].values
    labels_list.append(labels)

# Convert features list to numpy array
features = np.asarray(features_list)

# Convert labels list to numpy array
labels = np.array(labels_list)



features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.25, shuffle = True, random_state = 42)

I have 16gb of RAM and the .npy files are (9221) in total, having the shape of (298, 32, 32, 3) with the size of 895kb each. Is something wrong or missing in my code? and Is there any way to load these npy files without taking too much of memory?

I've tried this potential solution: del data gc.collect()

but it doesn't work in my case. Hoping for a kind answer.

Solution

features_list is about 9 GB, and features is another 9 GB. That's more than 16GB.

It might be easier to read one .npy file directly into features, reshape it to add an extra dimension of size 1, resize that dimension to 9221, and then assign the next 9220 files to slices of features. This means you're assigning 895 kB at a time, and using only a little more than 9 GB.

The trick with numpy and big data is to keep everything in numpy, and not in raw Python lists, and certainly not convert between the two.