python list optimization python-itertools cartesian-product

Memory leakage issue in python list

The identities list contains a big array of approximately 57000 images. Now, I am creating a negative list with the help of itertools.product(). This stores the whole list in memory which is very costly and my system hanged after 4 minutes.

How can I optimize the below code and avoid saving in memory?`

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        cross_product = itertools.product(samples_list[i], samples_list[j])
        cross_product = list(cross_product)

        for cross_sample in cross_product:
            negative = []
            negative.append(cross_sample[0])
            negative.append(cross_sample[1])
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])

The memory 9.30 is going to be higher and higher and on one point the system has been completely hanged.

I also implemented the below answer and modified code according to his answer.

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):
            negative = [cross_sample[0], cross_sample[1]]
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

Third version of code

This CSV file is too big even if you open a file then it gives an alert that your program can not load all files. Regarding the process, it ten minutes, and then again the system going to be hanged completely.

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        for cross_sample in itertools.product(samples_list[i], samples_list[j]):
            with open('/home/khawar/deepface/tests/results.csv', 'a+') as csvfile:
                writer = csv.writer(csvfile)
                writer.writerow([cross_sample[0], cross_sample[1]])
            negative = [cross_sample[0], cross_sample[1]]
            negatives.append(negative)

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])

Memory screenshot.

Solution

Actually, the generated pairs are saved in your memory and that's why your memory going to be higher and higher.

You have to change the code in which you will generate pairs and immediately release them from memory.

Previous Code:

for i in range(0, len(idendities) - 1):
    for j in range(i + 1, len(idendities)):
        cross_product = itertools.product(samples_list[i], samples_list[j])
        cross_product = list(cross_product)

        for cross_sample in cross_product:
            negative = []
            negative.append(cross_sample[0])
            negative.append(cross_sample[1])
            negatives.append(negative)
            print(len(negatives))

negatives = pd.DataFrame(negatives, columns=["file_x", "file_y"])
negatives["decision"] = "No"

Memory Efficient Code Save pairs in the list then second time no need to generate it again.

samples_list = list(identities.values())
negatives = pd.DataFrame()

    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        for combo in tqdm(itertools.combinations(identities.values(), 2), desc="Negatives"):
            for cross_sample in itertools.product(combo[0], combo[1]):
                negatives = negatives.append(pd.Series({"file_x": cross_sample[0], "file_y": cross_sample[1]}).T,
                                             ignore_index=True)
        negatives["decision"] = "No"
        negatives = negatives.sample(positives.shape[0])
        df = pd.concat([positives, negatives]).reset_index(drop=True)
        df.to_csv("positives_negatives.csv", index=False)