python performance optimization dataset nested-loops

Optimizing Nested Loop Performance for Large Dataset Processing in Python

I am currently working on a data analysis script in Python 3.8 where I need to process a large dataset consisting of over a million rows. My script uses nested loops to compare each row with multiple other rows based on certain criteria. I've noticed that the performance is significantly slow, and I suspect the nested loops are the bottleneck.

Here's a simplified version of the problematic part of the code:

import csv

file_path = 'data.csv'


data = []
with open(file_path, 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        data.append(row)
  
matching_pairs = []  # List to store the indices of matching row pairs

for i in range(len(data)):
    for j in range(i + 1, len(data)):
        if data[i][0] == data[j][0]: 
            # Append the pair of indices to the matching_pairs list
            matching_pairs.append(i)


output_file = 'matching_pairs.txt'
with open(output_file, 'w') as file:
    for pair in matching_pairs:
        file.write(f'{pair}\n')

The inner loop compares the current row with all subsequent rows, which is essential for my analysis. However, I expected the processing to be quicker. I'm looking for a way to optimize this part of the code to reduce the execution time.

What strategies can I employ to improve the performance of such intensive operations in Python? Are there built-in libraries or techniques in Python that could help optimize this nested loop?

Solution

to get the rows where some column value is duplicated you can use groupby and exclude groups with length of 1.

import pandas as pd
df = pd.DataFrame({'val':[1,2,1,2,3,3,4],'data':['A','B','C','D','E','F','G']})
groups = df.groupby('val', sort=False)
results = []
for group in groups:
  if len(group[1]) != 1:
    results.extend(group[1].index[:-1])
print(results)

[0, 1, 4]

a faster version in pure python is as follows:

from collections import defaultdict
data = [1,2,1,2,3,3,4]
matching_pairs = []
groups = defaultdict(list)
for i in range(len(data)):
    groups[data[i]].append(i)
for group in groups.values():
    if len(group) != 1:
        matching_pairs.extend(group[:-1])
print(matching_pairs)

timings for a list with 1 million entries with repetition of 3.

time pandas groupby = 9.831918954849243 seconds
time python groupby = 0.6703155040740967 seconds

the pure python version is faster because the pandas version wastes all the time on converting from python objects to pandas back to python, it would be faster if pandas was doing all the heavy-lifting from file reading to grouping to writing.