python pandas dataframe export-to-excel chunks

Why pandas can't process more than 550 dataframes in this case

import pandas as pd

file1 = 'Total.xlsx'
df1 = pd.read_excel(file1)

file2 = 'Recent.xlsx'
df2 = pd.read_excel(file2)

non_matching_rows = []

for index1, row1 in df1.iterrows():
    row_matches = False
    
    for index2, row2 in df2.iterrows():
        if row1.equals(row2):
            row_matches = True
            break
    
    if not row_matches:
        non_matching_rows.append(row1)

non_matching_df = pd.DataFrame(non_matching_rows)

display(non_matching_df)
print(non_matching_df.count())

Total.xlsx contains almost 40k records and Recent.xlsx has almost 36k records. I needed to find the remaining 4k records that is unique in Total.xlsx. I tried the above code, but it is not working for the entire excel files. When I tried to reduce the records in both the files, it can process and produce accurate results [only to 550 records]. Any file more than 500 records it is not working (I tried chunk size also, didn't win). Any suggestions??

Solution

Your code takes too long for large files due to O(n^2) time complexity. Use merge() method instead of iterating over the rows of the dataframes. Here's an example:

merged_df = pd.merge(df1, df2, how='outer', indicator=True)
non_matching_df = merged_df[merged_df['_merge'] == 'left_only'].drop('_merge', axis=1)

display(non_matching_df)
print(non_matching_df.count())

The indicator=True parameter adds a column _merge to the merged dataframe that indicates whether each row is present in both dataframes (both), only in the left dataframe (left_only), or only in the right dataframe (right_only).