import pandas as pd
file1 = 'Total.xlsx'
df1 = pd.read_excel(file1)
file2 = 'Recent.xlsx'
df2 = pd.read_excel(file2)
non_matching_rows = []
for index1, row1 in df1.iterrows():
row_matches = False
for index2, row2 in df2.iterrows():
if row1.equals(row2):
row_matches = True
break
if not row_matches:
non_matching_rows.append(row1)
non_matching_df = pd.DataFrame(non_matching_rows)
display(non_matching_df)
print(non_matching_df.count())
Total.xlsx contains almost 40k records and Recent.xlsx has almost 36k records. I needed to find the remaining 4k records that is unique in Total.xlsx. I tried the above code, but it is not working for the entire excel files. When I tried to reduce the records in both the files, it can process and produce accurate results [only to 550 records]. Any file more than 500 records it is not working (I tried chunk size also, didn't win). Any suggestions??
Your code takes too long for large files due to O(n^2) time complexity. Use merge() method instead of iterating over the rows of the dataframes. Here's an example:
merged_df = pd.merge(df1, df2, how='outer', indicator=True)
non_matching_df = merged_df[merged_df['_merge'] == 'left_only'].drop('_merge', axis=1)
display(non_matching_df)
print(non_matching_df.count())
The indicator=True parameter adds a column _merge to the merged dataframe that indicates whether each row is present in both dataframes (both), only in the left dataframe (left_only), or only in the right dataframe (right_only).