Search code examples
pythonpandasdataframeperformance

Comparing 2 Pandas dataframes row by row and performing a calculation on each row


I have created 2 Pandas dataframes and I need to compare rows in DF1 to rows in DF2, by way of a 2-column distance function.

Basically I need every row in DF2 to be compared to row 1 in DF 1; every row in DF2 to be compared to row 2 in DF1, etc provided they have the same Participant number and same Trial number. If the resulting distance value is less than a specific threshold, I need to document what row that was from DF2 in a new column.

I'm very new to Pandas, so I'm not quite sure where to start with this, but I have detailed the step-by-step logic of my problem below.

Here are my 2 dataframes (samples - the actual dataframes will be extremely large; 10,000+ rows each):

DF1:

IP_INDEX IP_LABEL PARTICIPANT TRIAL CURRENT_FIX_INDEX PRE_FIX_X PRE_FIX_Y
1 1st 3 Seconds a 1 1 550 150
1 1st 3 Seconds a 1 2 600 300
1 1st 3 Seconds a 2 1 250 600
1 1st 3 Seconds b 1 1 400 400
1 1st 3 Seconds b 2 1 600 400

DF2:

IP_INDEX IP_LABEL PARTICIPANT TRIAL CURRENT_FIX_INDEX POST_FIX_X POST_FIX_Y
2 2nd 3 Seconds a 1 1 500 200
2 2nd 3 Seconds a 1 2 650 350
2 2nd 3 Seconds a 2 1 300 650
2 2nd 3 Seconds b 1 1 250 700
2 2nd 3 Seconds b 1 2 450 150
2 2nd 3 Seconds b 2 1 550 350
2 2nd 3 Seconds b 2 2 350 550

Here is what I need (example for any given row):

  1. Create a new column in DF1 called 'REFIX'. This will be blank for now, but cells will get filled in later based on the formula below.

  2. If DF2 'PARTICIPANT' == DF1 'PARTICIPANT and DF2 'TRIAL' == DF1 'TRIAL',

    1. Then calculate math.dist(p, q) where p and q are defined by the ***_FIX_X and ***_FIX_Y values from DF1 and DF2 (the *** stands for PRE or POST based on whether it's DF1 or DF2).
    2. Else (i.e. If DF2 'PARTICIPANT' != DF1 'PARTICIPANT' or DF2 'TRIAL' != DF1 'TRIAL'), then move on to the next row in DF2
  3. If result of math.dist(p, q) <= 150, then fill in 'REFIX' column in DF1 (for the row we're on in DF1) with the value of CURRENT_FIX_INDEX in DF2 (for the row that we're on in DF2)

  4. Save DF1 as a new csv file

I'm not sure if I'm correct here, but this is what I have for determining the p and q values for the math.dist(p, q) function:

# Define p and q values for distance calculation
p = [(df1['PRE_FIX_X']), (df1['PRE_FIX_Y'])]
q = [(df2['POST_FIX_X']), (df2['POST_FIX_Y'])]

Based on the example dataframes given, a complete DF1 file would look like this:

IP_INDEX IP_LABEL PARTICIPANT TRIAL CURRENT_FIX_INDEX PRE_FIX_X PRE_FIX_Y REFIX
1 1st 3 Seconds a 1 1 550 150 1
1 1st 3 Seconds a 1 2 600 300 1,2
1 1st 3 Seconds a 2 1 250 600 1
1 1st 3 Seconds b 1 1 400 400
1 1st 3 Seconds b 2 1 600 400 1

Solution

  • EDITED: Now includes count of REFIXes per trial as requested in comments

    Your problem appears quite complex and was a smidge hard to follow but I think this may be what you want?

    I don't think iterating through the two df's with nested if statements is the most efficient route but I can't think of a better option right now :)

    Part 1: Loading in the data you provided: (You can obviously ignore this bit, but putting it in for easy reproducibility/testing:

    import pandas as pd
    
    # Define the data for DF1
    data_df1 = {
        'IP_INDEX': [1, 1, 1, 1, 1],
        'IP_LABEL': ['1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds', '1st 3 Seconds'],
        'PARTICIPANT': ['a', 'a', 'a', 'b', 'b'],
        'TRIAL': [1, 1, 2, 1, 2],
        'CURRENT_FIX_INDEX': [1, 2, 1, 1, 1],
        'PRE_FIX_X': [550, 600, 250, 400, 600],
        'PRE_FIX_Y': [150, 300, 600, 400, 400]
    }
    
    # Create DF1
    df1 = pd.DataFrame(data_df1)
    
    data_df2 = {
        'IP_INDEX': [2, 2, 2, 2, 2, 2, 2],
        'IP_LABEL': ['2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds', '2nd 3 Seconds'],
        'PARTICIPANT': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
        'TRIAL': [1, 1, 2, 1, 1, 2, 2],
        'CURRENT_FIX_INDEX': [1, 2, 1, 1, 2, 1, 2],
        'POST_FIX_X': [500, 650, 300, 250, 450, 550, 350],
        'POST_FIX_Y': [200, 350, 650, 700, 150, 350, 550]
    }
    
    # Create DF2
    df2 = pd.DataFrame(data_df2)
    

    Part 2: Doing the calculations:

    import math
    
    #Create function to calculate distance between coordinates
    def calculate_distance(df1_row,df2_row):
        p = (df1_row['PRE_FIX_X'], df1_row['PRE_FIX_Y'])
        q = (df2_row['POST_FIX_X'], df2_row['POST_FIX_Y'])
        return math.dist(p, q)
    
    refix_values = [] #Create empty list to hold lists of matching FIX_INDEX values in df2
    
    for index, df1_row in df1.iterrows(): #Iterate through each row in df1
        fix_index_list = [] #Create empty list to hold matching INDEX values for loop row
        df2_subset = df2.loc[(df2['PARTICIPANT'] == df1_row['PARTICIPANT']) & (df2['TRIAL'] == df1_row['TRIAL'])] #Subset out matching data in df2 so we're not looping through the entirety of df2
        for i, df2_row in df2_subset.iterrows(): #Iterate through df2 subset
            dist = calculate_distance(df1_row,df2_row) #Calculate dist for each row
            if dist <= 150:
                fix_index_list.append(df2_row['CURRENT_FIX_INDEX']) #Add dist to list if greater or equal to 150
        refix_values.append(','.join(map(str, fix_index_list))) #Add list of matching INDEX values to list of lists
    
    df1['REFIX'] = refix_values #Convert list of lists to new column
    
    def count_refix_items(row_refix_value):
        integers = [int(s) for s in row_refix_value.split(',') if s.isdigit()] #Split string by commas and count resultant elements
        return len(integers)
    
    refix_per_trial = df1[df1['REFIX'].notna()].groupby(['TRIAL','PARTICIPANT'])['REFIX'].apply(lambda x: x.apply(count_refix_items).sum()).reset_index()
    refix_per_trial = refix_per_trial.rename(columns={'REFIX': 'REFIXes/trial/participant'}) #Rename column
    
    df1 = pd.merge(df1, refix_per_trial, on = ['TRIAL','PARTICIPANT'], how = 'outer')
    
    #Write df to csv
    df1.to_csv('REFIX_calc_output.tab', sep='\t', index=False) #Change name of file to whatever you want
    

    The output I got using the input you provided:

    IP_INDEX IP_LABEL PARTICIPANT TRIAL CURRENT_FIX_INDEX PRE_FIX_X PRE_FIX_Y REFIX REFIXes/trial/participant
    1 1st 3 Seconds a 1 1 550 150 1 3
    1 1st 3 Seconds a 1 2 600 300 1,2 3
    1 1st 3 Seconds a 2 1 250 600 1 1
    1 1st 3 Seconds b 1 1 400 400 0
    1 1st 3 Seconds b 2 1 600 400 1 1