Search code examples
pythonpandasperformancevectorizationpairwise-distance

Vectorize nestled loops for pairwise distance calculation


How to make the script below more efficient? This is a follow-up to my previous post Python nested loop issue
It currently takes the best part of two hours to process input tables consisting in about 15000 and 1500 rows. Manually processing my data in Excel takes me an order of magnitude less time - not ideal!

I understand that iterrows is a bad approach to the problem, and that vectorisation is the way forward, but I am a bit dumbfounded at how it would work in regards to the second for loop.

The following script extract takes two dataframes,

  • qinsy_file_2
  • segy_vlookup (ignore the naming on that one).

For every row in the qinsy_file_2, it iterates through segy_vlookup to calculate distances between coordinates in each file. If this distance is less than a pre-given value (here named buffer), it will get transcribed to a new dataframe out_df(otherwise it will pass over the row).

# Loop through Qinsy file
for index_qinsy,row_qinsy in qinsy_file_2.iterrows():
    # Loop through SEGY navigation
    for index_segy,row_segy in segy_vlookup.iterrows():
        # Calculate distance between points
        if ((((segy_vlookup["CDP_X"][index_segy] - qinsy_file_2["CMP Easting"][index_qinsy])**2) + ((segy_vlookup["CDP_Y"][index_segy] - qinsy_file_2["CMP Northing"][index_qinsy])**2))**0.5)<= buffer:
            # Append rows less than the distance modifier to the new dataframe 
            out_df=pd.concat([out_df,row_qinsy])
            break
        else:
                pass

So far I have read through the following:


Solution

  • Extending from the loop vectorization domain you looked into, try keywords pairwise distance calculation, e.g.: How to efficiently calculate euclidean distance matrix for several timeseries

    The method below is fully vectorized (no for loops at all). It passes by Numpy for the 2D intensive part, then returns a pandas dataframe as requested. Calculating distances and filtering rows by <= buffer, with your input data sizes of 1500 * 15000 rows, is accomplished in a fraction of a second.


    Mock-up data
    Let

    • dfA represent qinsy_file_2 with nA = 15000 points of 2D coordinates ('xA', 'yA')
    • dfB represent segy_vlookup with nB = 1500 points of 2D coordinates ('xB', 'yB')
    import numpy as np
    import pandas as pd
    
    buffer = 0.01 # arbitrary value
    
    dfA = pd.DataFrame({'xA' : np.random.default_rng(seed=0).random(nA), 
                        'yA' : np.random.default_rng(seed=1).random(nA)})
    dfB = pd.DataFrame({'xB' : np.random.default_rng(seed=3).random(nB), 
                        'yB' : np.random.default_rng(seed=4).random(nB)})
    

    With these given seeds if you wanted to reproduce the process:

                 xA        yA
    0      0.636962  0.511822
    1      0.269787  0.950464
    2      0.040974  0.144160
    3      0.016528  0.948649
    4      0.813270  0.311831
    ...         ...       ...
    14995  0.457170  0.507611
    14996  0.283829  0.828005
    14997  0.679110  0.910104
    14998  0.273703  0.294932
    14999  0.097813  0.366295
    
    [15000 rows x 2 columns]
    
                xB        yB
    0     0.085649  0.943056
    1     0.236811  0.511328
    2     0.801274  0.976244
    3     0.582162  0.080836
    4     0.094129  0.607356
    ...        ...       ...
    1495  0.684086  0.821719
    1496  0.383240  0.957020
    1497  0.341389  0.064735
    1498  0.032523  0.234434
    1499  0.500383  0.931106
    
    [1500 rows x 2 columns]
    

    One-liner summary

    df_out = dfA[np.sum((np.sqrt((np.atleast_2d(dfA['xA'].values) - 
                                  np.atleast_2d(dfB['xB'].values).T) **2 +
                                 (np.atleast_2d(dfA['yA'].values) - 
                                  np.atleast_2d(dfB['yB'].values).T) **2))<=buffer, axis=0)>0]
    

    Detailed answer

    Each point from dfA will be tested for distance < buffer regarding each point from dfB, to determine if the corresponding row of dfA should be selected into df_out

    Distance matrix
    Every column in the output below stands for a row in dfA:

    D = np.sqrt((np.atleast_2d(dfA['xA'].values) - 
                 np.atleast_2d(dfB['xB'].values).T) **2 +
                (np.atleast_2d(dfA['yA'].values) - 
                 np.atleast_2d(dfB['yB'].values).T) **2)
    D
    [[0.69993476 0.18428648 0.80014469 ... 0.59437473 0.67485498 0.57688898]
     [0.40015149 0.44037255 0.41613029 ... 0.59552573 0.21951799 0.20088488]
     [0.49263227 0.53211262 1.12712974 ... 0.13891991 0.86169468 0.93107222]
     ...
     [0.535957   0.88861864 0.3107378  ... 0.91033173 0.2399422  0.38764484]
     [0.66504838 0.75431549 0.0906694  ... 0.93520197 0.24865128 0.14713943]
     [0.44096846 0.23140718 0.91123077 ... 0.17995671 0.6753528  0.69359482]]
    

    Now filter by threshold (distance < buffer)
    At least one value per column in D is enough to select that column for df_out, i.e. that row in dfA.

    b = np.sum(D<=buffer, axis=0)>0 # boolean selector
    [ True False False ... False  True  True] # In this example, at least the first, 
    # last and former last rows will be selected, and probably many more in between.
    

    Finally operate row-wise selection on dfA:

    df_out = dfA[b]
                 xA        yA
    0      0.636962  0.511822
    4      0.813270  0.311831
    9      0.935072  0.027559
    10     0.815854  0.753513
    11     0.002739  0.538143
    ...         ...       ...
    14988  0.039833  0.239034
    14994  0.243440  0.428287
    14996  0.283829  0.828005
    14998  0.273703  0.294932
    14999  0.097813  0.366295
    
    [5620 rows x 2 columns]
    

    In this mock-up example, 5620 rows out of 150000 made it to the selection.
    Naturally any additional columns that your actual dfA might have will also be transcribed into df_out.

    To go further, but this answer already reprsents a dramatic improvement over nested loops, reduce number of distances actually calculated?

    • Here no playing with symmetry to reduce calculations as in actual pairwise distance algorithms (that typically tries a dataset against itself). One has to calculate all products.
    • I wonder, though, would it be advantageous to go by chunks, calculating the next chunk for that point only if condition was not already satisfied.