Search code examples
python-3.xfor-loopnested-loops

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?


I have a nested loop that has to loop through a huge amount of data.

Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.

Explaining what the code is supposed to do:

  • Each row represents a coordinates in X-Y plane.
  • r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
  • For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
  • For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
  • Then the window moves one point down the rows and repeat this process till S_wind is completed.

What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?

Many thanks in advance!

import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']


####====creating upper and lower bound for the diameter of the investigation circles    
x_range =max(df['X']) - min(df['X']) 
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)  
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800  
r_test = np.arange(r1, r2, 5) 

S_wind = np.zeros(len(df['X'])) + np.inf

for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
        c_10 = np.zeros(len(r_test)) +np.inf
        H = 0
        C = 0
        N = 10 ##### maybe I should also remove this
        for ind in range(len(r_test)):
            for i in range (ii-10,ii):
                for j in range(ii-10,ii):
                    dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
                    if dd > 0:
                        H += 1
            c_10[ind] = (H/(N**2))

        S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]   

Solution

  • You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.

    The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:

    # replacing the `for ind` loop and its contents:
    points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
    differences = np.subtract(points[None, :, :],  points[:, None, :]) # broadcast to 10x10x2
    squared_distances = (differences * differences).sum(axis=2)
    within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None]  # compare squares
    c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
    
    S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
    

    I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!

    Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).