Search code examples
pythonperformanceloopsmemory-efficient

Efficient and Faster way to calculate the below problem without memory usage of python


I am currently working in a project where I need to loop through millions of rows and do calculation of each loop. Below is the code and df1 is a dataframe which contains 40,000 rows and df2 is also a dataframe which contains 60,000 rows.

mycolumns = ['RowValue_df1','RowValue2_df1','RowValue_df2','RowValue2_df2']
combination = pd.DataFrame(columns=mycolumns)
custom_val = 50
for i in range(0,len(df1)):
    RowValue_df1 = df1["Column1"][i]
    RowValue2_df1 = df1["Column2"][i]

    for m in range(0,len(df2)):
        RowValue_df2 = df2["Column1"][m]
        RowValue2_df2 = df2["Column2"][m]
    
        calc_val = ((RowValue_df1/RowValue_df2) * (RowValue2_df2/RowValue_df1)) * 100 #just an example
    
        if calc_val <= custom_val:
            combination = combination.append(pd.Series([
                RowValue_df1,
                RowValue2_df1,
                RowValue_df2,
                RowValue2_df2,
                ],index = mycolumns),ignore_index=True)

It takes so much of time that I was not even able to run it fully. Is there any efficient way to change the above code.


Solution

  • Two changes can be applied to speed up this code:

    • combination.append is very slow because it recreate a new dataframe for each new appended line. You can append the lines to a Python list and then use create the final dataframe from the resulting list. This should be much much faster with a list.
    • The inner m-based loop can be vectorized using Numpy. You can compute calc_val by working directly on columns and not values and you can use where of Numpy to filter the elements.