I have created 2 Pandas dataframes and I need to compare rows in DF1 to rows in DF2, by way of a 2-column distance function.
Basically I need every row in DF2 to be compared to row 1 in DF 1; every row in DF2 to be compared to row 2 in DF1, etc provided they have the same Participant number and same Trial number. If the resulting distance value is less than a specific threshold, I need to document what row that was from DF2 in a new column.
I'm very new to Pandas, so I'm not quite sure where to start with this, but I have detailed the step-by-step logic of my problem below.
Here are my 2 dataframes (samples - the actual dataframes will be extremely large; 10,000+ rows each):
DF1:
IP_INDEX | IP_LABEL | PARTICIPANT | TRIAL | CURRENT_FIX_INDEX | PRE_FIX_X | PRE_FIX_Y |
---|---|---|---|---|---|---|
1 | 1st 3 Seconds | a | 1 | 1 | 550 | 150 |
1 | 1st 3 Seconds | a | 1 | 2 | 600 | 300 |
1 | 1st 3 Seconds | a | 2 | 1 | 250 | 600 |
1 | 1st 3 Seconds | b | 1 | 1 | 400 | 400 |
1 | 1st 3 Seconds | b | 2 | 1 | 600 | 400 |
DF2:
IP_INDEX | IP_LABEL | PARTICIPANT | TRIAL | CURRENT_FIX_INDEX | POST_FIX_X | POST_FIX_Y |
---|---|---|---|---|---|---|
2 | 2nd 3 Seconds | a | 1 | 1 | 500 | 200 |
2 | 2nd 3 Seconds | a | 1 | 2 | 650 | 350 |
2 | 2nd 3 Seconds | a | 2 | 1 | 300 | 650 |
2 | 2nd 3 Seconds | b | 1 | 1 | 250 | 700 |
2 | 2nd 3 Seconds | b | 1 | 2 | 450 | 150 |
2 | 2nd 3 Seconds | b | 2 | 1 | 550 | 350 |
2 | 2nd 3 Seconds | b | 2 | 2 | 350 | 550 |
Here is what I need (example for any given row):
Create a new column in DF1 called 'REFIX'. This will be blank for now, but cells will get filled in later based on the formula below.
If DF2 'PARTICIPANT' == DF1 'PARTICIPANT and DF2 'TRIAL' == DF1 'TRIAL',
If result of math.dist(p, q) <= 150, then fill in 'REFIX' column in DF1 (for the row we're on in DF1) with the value of CURRENT_FIX_INDEX in DF2 (for the row that we're on in DF2)
Save DF1 as a new csv file
I'm not sure if I'm correct here, but this is what I have for determining the p and q values for the math.dist(p, q) function:
# Define p and q values for distance calculation
p = [(df1['PRE_FIX_X']), (df1['PRE_FIX_Y'])]
q = [(df2['POST_FIX_X']), (df2['POST_FIX_Y'])]
Based on the example dataframes given, a complete DF1 file would look like this:
IP_INDEX | IP_LABEL | PARTICIPANT | TRIAL | CURRENT_FIX_INDEX | PRE_FIX_X | PRE_FIX_Y | REFIX |
---|---|---|---|---|---|---|---|
1 | 1st 3 Seconds | a | 1 | 1 | 550 | 150 | 1 |
1 | 1st 3 Seconds | a | 1 | 2 | 600 | 300 | 1,2 |
1 | 1st 3 Seconds | a | 2 | 1 | 250 | 600 | 1 |
1 | 1st 3 Seconds | b | 1 | 1 | 400 | 400 | |
1 | 1st 3 Seconds | b | 2 | 1 | 600 | 400 | 1 |
EDITED: Now includes count of REFIXes per trial as requested in comments
Your problem appears quite complex and was a smidge hard to follow but I think this may be what you want?
I don't think iterating through the two df's with nested if statements is the most efficient route but I can't think of a better option right now :)
Part 1: Loading in the data you provided: (You can obviously ignore this bit, but putting it in for easy reproducibility/testing:
import pandas as pd
# Define the data for DF1
data_df1 = {
'IP_INDEX': [1, 1, 1, 1, 1],
'IP_LABEL': ['1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds'],
'PARTICIPANT': ['a', 'a', 'a', 'b', 'b'],
'TRIAL': [1, 1, 2, 1, 2],
'CURRENT_FIX_INDEX': [1, 2, 1, 1, 1],
'PRE_FIX_X': [550, 600, 250, 400, 600],
'PRE_FIX_Y': [150, 300, 600, 400, 400]
}
# Create DF1
df1 = pd.DataFrame(data_df1)
data_df2 = {
'IP_INDEX': [2, 2, 2, 2, 2, 2, 2],
'IP_LABEL': ['2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds'],
'PARTICIPANT': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
'TRIAL': [1, 1, 2, 1, 1, 2, 2],
'CURRENT_FIX_INDEX': [1, 2, 1, 1, 2, 1, 2],
'POST_FIX_X': [500, 650, 300, 250, 450, 550, 350],
'POST_FIX_Y': [200, 350, 650, 700, 150, 350, 550]
}
# Create DF2
df2 = pd.DataFrame(data_df2)
Part 2: Doing the calculations:
import math
#Create function to calculate distance between coordinates
def calculate_distance(df1_row,df2_row):
p = (df1_row['PRE_FIX_X'], df1_row['PRE_FIX_Y'])
q = (df2_row['POST_FIX_X'], df2_row['POST_FIX_Y'])
return math.dist(p, q)
refix_values = [] #Create empty list to hold lists of matching FIX_INDEX values in df2
for index, df1_row in df1.iterrows(): #Iterate through each row in df1
fix_index_list = [] #Create empty list to hold matching INDEX values for loop row
df2_subset = df2.loc[(df2['PARTICIPANT'] == df1_row['PARTICIPANT']) & (df2['TRIAL'] == df1_row['TRIAL'])] #Subset out matching data in df2 so we're not looping through the entirety of df2
for i, df2_row in df2_subset.iterrows(): #Iterate through df2 subset
dist = calculate_distance(df1_row,df2_row) #Calculate dist for each row
if dist <= 150:
fix_index_list.append(df2_row['CURRENT_FIX_INDEX']) #Add dist to list if greater or equal to 150
refix_values.append(','.join(map(str, fix_index_list))) #Add list of matching INDEX values to list of lists
df1['REFIX'] = refix_values #Convert list of lists to new column
def count_refix_items(row_refix_value):
integers = [int(s) for s in row_refix_value.split(',') if s.isdigit()] #Split string by commas and count resultant elements
return len(integers)
refix_per_trial = df1[df1['REFIX'].notna()].groupby(['TRIAL','PARTICIPANT'])['REFIX'].apply(lambda x: x.apply(count_refix_items).sum()).reset_index()
refix_per_trial = refix_per_trial.rename(columns={'REFIX': 'REFIXes/trial/participant'}) #Rename column
df1 = pd.merge(df1, refix_per_trial, on = ['TRIAL','PARTICIPANT'], how = 'outer')
#Write df to csv
df1.to_csv('REFIX_calc_output.tab', sep='\t', index=False) #Change name of file to whatever you want
The output I got using the input you provided:
IP_INDEX | IP_LABEL | PARTICIPANT | TRIAL | CURRENT_FIX_INDEX | PRE_FIX_X | PRE_FIX_Y | REFIX | REFIXes/trial/participant |
---|---|---|---|---|---|---|---|---|
1 | 1st 3 Seconds | a | 1 | 1 | 550 | 150 | 1 | 3 |
1 | 1st 3 Seconds | a | 1 | 2 | 600 | 300 | 1,2 | 3 |
1 | 1st 3 Seconds | a | 2 | 1 | 250 | 600 | 1 | 1 |
1 | 1st 3 Seconds | b | 1 | 1 | 400 | 400 | 0 | |
1 | 1st 3 Seconds | b | 2 | 1 | 600 | 400 | 1 | 1 |