Search code examples
pythonpandasdata-analysis

How to create a new column in dataframe, which will be a function of another column and conditionals faster than a for loop?


I have a relatively large data frame (8737 rows and 16 columns of all variable types, strings, integers, booleans etc.) and I want to create a new column based on an equation and some conditionals. Basically, I want to iterate over one particular column, take its values and after an operation calculate a value. Then I need to check if this value satisfies some conditions (>= or < to a set value). If it satisfies the conditionals then I need to keep the output of the calculation as an element of the new column, else assign a fixed value to that element.

I am doing that by looping over the entire dataset with a for loop, which takes a huge amount of time. I am quite new to python and couldn't quite find any similar problem solution online, other than alternating existing columns without a for loop.

Lets say for the sake of simplicity I have this data frame called df:

df=pd.DataFrame()
df['A']=[5,3,7,4,3,0,1,7,8,10,9,4,3,2,0]
df['S']=np.nan
df['S'][0]=5

Out[4]: 
     A    S
0    5  5.0
1    3  NaN
2    7  NaN
3    4  NaN
4    3  NaN
5    0  NaN
6    1  NaN
7    7  NaN
8    8  NaN
9   10  NaN
10   9  NaN
11   4  NaN
12   3  NaN
13   2  NaN
14   0  NaN

where S is the new column I need to calculate, starting from a set value. Next value of S I need to be the previous value of S plus some calculation. Basically, for every step of the calculation I want S(i)=S(i-1)+A(i)x0.5-2* and right after this to check if S(i)=<5 then S(i)=5, elseif S(i)>=10 then S(i)=10. This is one step of the itteration. My code is this with the desired outcome:

for i in range (1,df.shape[0]):
    df.S[i]=df.S[i-1]+df.A[i]*0.5-2
    if df.S[i]<=5:
        df.S[i]=5
    elif df.S[i]>10:
        df.S[i]=10
df

Out[6]: 
     A     S
0    5   5.0
1    3   5.0
2    7   6.5
3    4   6.5
4    3   6.0
5    0   5.0
6    1   5.0
7    7   6.5
8    8   8.5
9   10  10.0
10   9  10.0
11   4  10.0
12   3   9.5
13   2   8.5
14   0   6.5

This code for 8737 rows takes around 20 mins to complete. Is there any way to do this without a for loop or with a different, faster itterative method? If you need any clarifications, please ask me. Thank you in advance.


Solution

  • You can try numba for optimalize it:

    from numba import jit
    @jit(nopython=True)
    def get_vals(a, first):
        out = np.zeros(a.shape[0], dtype=float)
        out[0] = first
        for i in range(1, a.shape[0]):
            out[i] = out[i-1] + a[i]*0.5-2
            out[i] = 5 if out[i]<=5 else out[i]
            out[i] = 10 if out[i] > 10 else out[i]
        return out
    
    arr = df['A'].values
    df['new'] = get_vals(arr, df.loc[0, 'S'])