Search code examples
pandasnumpyoptimizationvectorizationnumba

How to efficiently and correctly implement numba jit decorator or apply vectorization instead of a for loop to speed up the program execution?


Atttempted to implement jit decorator to increase the speed of execution of my code. Not getting proper results. It is throughing all sorts of errors.. Key error, type errors, etc.. The actual code without numba is working without any issues.

# The Code without numba is:
df = pd.DataFrame()
df['Serial'] = [865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880]
df['Value'] = [586,586.45,585.95,585.85,585.45,585.5,586,585.7,585.7,585.5,585.5,585.45,585.3,584,584,585]
df['Ref'] = [586.35,586.1,586.01,586.44,586.04,585.91,585.47,585.99,585.35,585.27,585.32,584.86,585.36,584.18,583.53,585]
df['Base'] = [0,-1,1,1,1,1,-1,0,1,1,1,0,1,1,0,-1]

df['A'] = 0.0
df['B'] = 0.0
df['Counter'] = 0
df['Counter'][0] = df['Serial'][0]

for i in range(0,len(df)-1):
    # Filling Column 'A'
    if (df.iloc[1+i,2] > df.iloc[1+i,1]) & (df.iloc[i,5] > df.iloc[1+i,1]) & (df.iloc[1+i,3] >0):
        df.iloc[1+i,4] = round((df.iloc[1+i,1]*1.02),2)
    elif (df.iloc[1+i,2] < df.iloc[1+i,1]) & (df.iloc[i,5] < df.iloc[1+i,1]) & (df.iloc[1+i,3] <0):
        df.iloc[1+i,4] = round((df.iloc[1+i,1]*0.98),2)
    else:
        df.iloc[1+i,4] = df.iloc[i,4]
    # Filling Column 'B'
    df.iloc[1+i,5] = round(((df.iloc[1+i,1] + df.iloc[1+i,2])/2),2) 
    # Filling Column 'Counter'
    if (df.iloc[1+i,5] > df.iloc[1+i,1]):
        df.iloc[1+i,6] = df.iloc[1+i,0]
    else:
        df.iloc[1+i,6] = df.iloc[i,6]
df

The below code is giving me the error. where i tried to implement numba jit decorator to speed up the original python code.

#The code with numba jit which is throwing error is:
df = pd.DataFrame()
df['Serial']=[865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880]
df['Value']=[586,586.45,585.95,585.85,585.45,585.5,586,585.7,585.7,585.5,585.5,585.45,585.3,584,584,585]
df['Ref']=[586.35,586.1,586.01,586.44,586.04,585.91,585.47,585.99,585.35,585.27,585.32,584.86,585.36,584.18,583.53,585]
df['Base'] = [0,-1,1,1,1,1,-1,0,1,1,1,0,1,1,0,-1]
from numba import jit
@jit(nopython=True)
def Calcs(Serial,Value,Ref,Base):
    n = Base.size
    A = np.empty(n, dtype='f8')
    B = np.empty(n, dtype='f8')
    Counter = np.empty(n, dtype='f8')
    A[0] = 0.0
    B[0] = 0.0
    Counter[0] = Serial[0]
    for i in range(0,n-1):
        # Filling Column 'A'
        if (Ref[i+1] > Value[i+1]) & (B[i] > Value[i+1]) & (Base[i+1] > 0):
            A[i+1] = round((Value[i+1]*1.02),2)
        elif (Ref[i+1] < Value[i+1]) & (B[i] < Value[i+1]) & (Base[i+1] < 0):  
            A[i+1] = round((Value[i+1]*0.98),2)
        else:
            A[i+1] = A[i]
        # Filling Column 'B'
        B[i+1] = round(((Value[i+1] + Ref[i+1])/2),2)
        # Filling Column 'Counter'
        if (B[i+1] > Value[i+1]):
            Counter[i+1] = Serial[i+1]
        else:
            Counter[i+1] = Counter[i]   
    List = [A,B,Counter]        
    return List

Serial = df['Serial'].values.astype(np.float64)
Value = df['Value'].values.astype(np.float64)
Ref = df['Ref'].values.astype(np.float64)
Base = df['Base'].values.astype(np.float64)

VCal = Calcs(Serial,Value,Ref,Base)

df['A'].values[:] = VCal[0].astype(object)
df['B'].values[:] = VCal[1].astype(object)
df['Counter'].values[:] = VCal[2].astype(object)
df

I tried to modify the code as per the guidance provided by @Jérôme Richard for the question How to Eliminate for loop in Pandas Dataframe in filling each row values of a column based on multiple if,elif statements.

But getting the errors and am Unable to correct the code. Looking for some help from the community in correcting & improving the above code or to find an even better code to enhance the speed of execution. The expected outcome of the code is shown in the below pic. enter image description here


Solution

  • You can only use df['A'].values[:] if the column A exists in the dataframe. Otherwise you need to create a new one, possibly with df['A'] = ....

    Moreover, the trick with astype(object) applies for string but not for numbers. Indeed, string-based dataframe columns do apparently not use Numpy string-based array but Numpy object-based arrays containing CPython strings. For numbers, Pandas properly uses number-based arrays. Converting numbers back to object is inefficient. The same applies for astype(np.float64): it is not needed if the time is already fine. This is the case here. If you are unsure about the input type, you can let them though as they are not very expensive.

    The Numba function itself is fine (at least with a recent version of Numba). Note that you can specify the signature to compile the function eagerly. This feature also help you to catch typing errors sooner and make them a bit more clear. The downside is that it makes the function less generic as only specific types are supported (though you can specify multiple signatures).

    from numba import njit
    
    @njit('List(float64[:])(float64[:], float64[:], float64[:], float64[:])')
    def Calcs(Serial,Value,Ref,Base):
        [...]
    
    Serial = df['Serial'].values
    Value = df['Value'].values
    Ref = df['Ref'].values
    Base = df['Base'].values
    
    VCal = Calcs(Serial, Value, Ref, Base)
    
    df['A'] = VCal[0]
    df['B'] = VCal[1]
    df['Counter'] = VCal[2]
    

    Note that you can use the Numba decorator flag fastmath=True to speed up the computation if you are sure that the input array never contains spacial values like NaN or Inf or -0, and you do not rely on FP-math associativity.