Search code examples
pythonpandasdataframefor-loopvectorization

Pandas for Loop Optimization(Vectorization) when looking at previous row value


I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.

The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.

def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
    print('Calculating Gains')
    df = Data
    df.fillna(0, inplace=True)
    df['MODE'] = ""
    df['Nominal'] = ""
    df.iloc[0, df.columns.get_loc('MODE')] = 0
    for i in range(1, (len(df.index))):
        print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
        if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
            df['MODE'].loc[i] = 1
        elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
            df['MODE'].loc[i] = 1
        else:
            df['MODE'].loc[i] = 0
    df[''] = (df['C']/6)
    for i in range(len(df.index)):
       print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
       if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
            df['Nominal'].loc[i] = rated/6
       else:
            df['Nominal'].loc[i] = 0
    df["Upgrade"] = df[""] - df["Nominal"]
    
    return df


A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))

df = pd.DataFrame()

df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)

MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150

finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)

Solution

  • Your second loop doesn't evaluate the prior row, so you should be able to use this instead

    df['Nominal'] = 0
    df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
    

    For your first loop, the elif statements looks to evaluate this

    ((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it

    df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
    

    then you may be able to collapse the other conditions into one statement use |

    Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.