Comparing 2 Pandas dataframes row by row and performing a calculation on each row

I have created 2 Pandas dataframes and I need to compare rows in DF1 to rows in DF2, by way of a 2-column distance function.

Basically I need every row in DF2 to be compared to row 1 in DF 1; every row in DF2 to be compared to row 2 in DF1, etc provided they have the same Participant number and same Trial number. If the resulting distance value is less than a specific threshold, I need to document what row that was from DF2 in a new column.

I'm very new to Pandas, so I'm not quite sure where to start with this, but I have detailed the step-by-step logic of my problem below.

Here are my 2 dataframes (samples - the actual dataframes will be extremely large; 10,000+ rows each):

DF1:

IP_INDEX	IP_LABEL	PARTICIPANT	TRIAL	CURRENT_FIX_INDEX	PRE_FIX_X	PRE_FIX_Y
1	1st 3 Seconds	a	1	1	550	150
1	1st 3 Seconds	a	1	2	600	300
1	1st 3 Seconds	a	2	1	250	600
1	1st 3 Seconds	b	1	1	400	400
1	1st 3 Seconds	b	2	1	600	400

DF2:

IP_INDEX	IP_LABEL	PARTICIPANT	TRIAL	CURRENT_FIX_INDEX	POST_FIX_X	POST_FIX_Y
2	2nd 3 Seconds	a	1	1	500	200
2	2nd 3 Seconds	a	1	2	650	350
2	2nd 3 Seconds	a	2	1	300	650
2	2nd 3 Seconds	b	1	1	250	700
2	2nd 3 Seconds	b	1	2	450	150
2	2nd 3 Seconds	b	2	1	550	350
2	2nd 3 Seconds	b	2	2	350	550

Here is what I need (example for any given row):

Create a new column in DF1 called 'REFIX'. This will be blank for now, but cells will get filled in later based on the formula below.
If DF2 'PARTICIPANT' == DF1 'PARTICIPANT and DF2 'TRIAL' == DF1 'TRIAL',
1. Then calculate math.dist(p, q) where p and q are defined by the ***_FIX_X and ***_FIX_Y values from DF1 and DF2 (the *** stands for PRE or POST based on whether it's DF1 or DF2).
2. Else (i.e. If DF2 'PARTICIPANT' != DF1 'PARTICIPANT' or DF2 'TRIAL' != DF1 'TRIAL'), then move on to the next row in DF2
If result of math.dist(p, q) <= 150, then fill in 'REFIX' column in DF1 (for the row we're on in DF1) with the value of CURRENT_FIX_INDEX in DF2 (for the row that we're on in DF2)
Save DF1 as a new csv file

I'm not sure if I'm correct here, but this is what I have for determining the p and q values for the math.dist(p, q) function:

# Define p and q values for distance calculation
p = [(df1['PRE_FIX_X']), (df1['PRE_FIX_Y'])]
q = [(df2['POST_FIX_X']), (df2['POST_FIX_Y'])]

Based on the example dataframes given, a complete DF1 file would look like this:

IP_INDEX	IP_LABEL	PARTICIPANT	TRIAL	CURRENT_FIX_INDEX	PRE_FIX_X	PRE_FIX_Y	REFIX
1	1st 3 Seconds	a	1	1	550	150	1
1	1st 3 Seconds	a	1	2	600	300	1,2
1	1st 3 Seconds	a	2	1	250	600	1
1	1st 3 Seconds	b	1	1	400	400
1	1st 3 Seconds	b	2	1	600	400	1

Solution

EDITED: Now includes count of REFIXes per trial as requested in comments

Your problem appears quite complex and was a smidge hard to follow but I think this may be what you want?

I don't think iterating through the two df's with nested if statements is the most efficient route but I can't think of a better option right now :)

Part 1: Loading in the data you provided: (You can obviously ignore this bit, but putting it in for easy reproducibility/testing:

import pandas as pd

# Define the data for DF1
data_df1 = {
    'IP_INDEX': [1, 1, 1, 1, 1],
    'IP_LABEL': ['1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds'],
    'PARTICIPANT': ['a', 'a', 'a', 'b', 'b'],
    'TRIAL': [1, 1, 2, 1, 2],
    'CURRENT_FIX_INDEX': [1, 2, 1, 1, 1],
    'PRE_FIX_X': [550, 600, 250, 400, 600],
    'PRE_FIX_Y': [150, 300, 600, 400, 400]
}

# Create DF1
df1 = pd.DataFrame(data_df1)

data_df2 = {
    'IP_INDEX': [2, 2, 2, 2, 2, 2, 2],
    'IP_LABEL': ['2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds'],
    'PARTICIPANT': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
    'TRIAL': [1, 1, 2, 1, 1, 2, 2],
    'CURRENT_FIX_INDEX': [1, 2, 1, 1, 2, 1, 2],
    'POST_FIX_X': [500, 650, 300, 250, 450, 550, 350],
    'POST_FIX_Y': [200, 350, 650, 700, 150, 350, 550]
}

# Create DF2
df2 = pd.DataFrame(data_df2)

Part 2: Doing the calculations:

import math

#Create function to calculate distance between coordinates
def calculate_distance(df1_row,df2_row):
    p = (df1_row['PRE_FIX_X'], df1_row['PRE_FIX_Y'])
    q = (df2_row['POST_FIX_X'], df2_row['POST_FIX_Y'])
    return math.dist(p, q)

refix_values = [] #Create empty list to hold lists of matching FIX_INDEX values in df2

for index, df1_row in df1.iterrows(): #Iterate through each row in df1
    fix_index_list = [] #Create empty list to hold matching INDEX values for loop row
    df2_subset = df2.loc[(df2['PARTICIPANT'] == df1_row['PARTICIPANT']) & (df2['TRIAL'] == df1_row['TRIAL'])] #Subset out matching data in df2 so we're not looping through the entirety of df2
    for i, df2_row in df2_subset.iterrows(): #Iterate through df2 subset
        dist = calculate_distance(df1_row,df2_row) #Calculate dist for each row
        if dist <= 150:
            fix_index_list.append(df2_row['CURRENT_FIX_INDEX']) #Add dist to list if greater or equal to 150
    refix_values.append(','.join(map(str, fix_index_list))) #Add list of matching INDEX values to list of lists

df1['REFIX'] = refix_values #Convert list of lists to new column

def count_refix_items(row_refix_value):
    integers = [int(s) for s in row_refix_value.split(',') if s.isdigit()] #Split string by commas and count resultant elements
    return len(integers)

refix_per_trial = df1[df1['REFIX'].notna()].groupby(['TRIAL','PARTICIPANT'])['REFIX'].apply(lambda x: x.apply(count_refix_items).sum()).reset_index()
refix_per_trial = refix_per_trial.rename(columns={'REFIX': 'REFIXes/trial/participant'}) #Rename column

df1 = pd.merge(df1, refix_per_trial, on = ['TRIAL','PARTICIPANT'], how = 'outer')

#Write df to csv
df1.to_csv('REFIX_calc_output.tab', sep='\t', index=False) #Change name of file to whatever you want

The output I got using the input you provided:

IP_INDEX	IP_LABEL	PARTICIPANT	TRIAL	CURRENT_FIX_INDEX	PRE_FIX_X	PRE_FIX_Y	REFIX	REFIXes/trial/participant
1	1st 3 Seconds	a	1	1	550	150	1	3
1	1st 3 Seconds	a	1	2	600	300	1,2	3
1	1st 3 Seconds	a	2	1	250	600	1	1
1	1st 3 Seconds	b	1	1	400	400		0
1	1st 3 Seconds	b	2	1	600	400	1	1