I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
r_test
is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).r_test
, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H
. Then we calculate H/(N**5)
and store it in c_10
with the index corresponding to that of the diameter of investigation.r_test
, we read the slope of the fitted line and save it to S_wind[ii]
. So the first 9 data points will have no value calculated for them thus giving them np.inf
to be distinguished later.S_wind
is completed.What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy
broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas
savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum
in my code, which corresponds to only re-initializing H
to zero in the outermost loop. If you don't want the matches for smaller values of r_test
to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0
line to in between the for ind
and the for i
loops in your original code).