I am currently working in a project where I need to loop through millions of rows and do calculation of each loop. Below is the code and df1 is a dataframe which contains 40,000 rows and df2 is also a dataframe which contains 60,000 rows.
mycolumns = ['RowValue_df1','RowValue2_df1','RowValue_df2','RowValue2_df2']
combination = pd.DataFrame(columns=mycolumns)
custom_val = 50
for i in range(0,len(df1)):
RowValue_df1 = df1["Column1"][i]
RowValue2_df1 = df1["Column2"][i]
for m in range(0,len(df2)):
RowValue_df2 = df2["Column1"][m]
RowValue2_df2 = df2["Column2"][m]
calc_val = ((RowValue_df1/RowValue_df2) * (RowValue2_df2/RowValue_df1)) * 100 #just an example
if calc_val <= custom_val:
combination = combination.append(pd.Series([
RowValue_df1,
RowValue2_df1,
RowValue_df2,
RowValue2_df2,
],index = mycolumns),ignore_index=True)
It takes so much of time that I was not even able to run it fully. Is there any efficient way to change the above code.
Two changes can be applied to speed up this code:
combination.append
is very slow because it recreate a new dataframe for each new appended line. You can append the lines to a Python list and then use create the final dataframe from the resulting list. This should be much much faster with a list.m
-based loop can be vectorized using Numpy. You can compute calc_val
by working directly on columns and not values and you can use where
of Numpy to filter the elements.