I have a relatively large data frame (8737 rows and 16 columns of all variable types, strings, integers, booleans etc.) and I want to create a new column based on an equation and some conditionals. Basically, I want to iterate over one particular column, take its values and after an operation calculate a value. Then I need to check if this value satisfies some conditions (>= or < to a set value). If it satisfies the conditionals then I need to keep the output of the calculation as an element of the new column, else assign a fixed value to that element.
I am doing that by looping over the entire dataset with a for loop, which takes a huge amount of time. I am quite new to python and couldn't quite find any similar problem solution online, other than alternating existing columns without a for loop.
Lets say for the sake of simplicity I have this data frame called df:
df=pd.DataFrame()
df['A']=[5,3,7,4,3,0,1,7,8,10,9,4,3,2,0]
df['S']=np.nan
df['S'][0]=5
Out[4]:
A S
0 5 5.0
1 3 NaN
2 7 NaN
3 4 NaN
4 3 NaN
5 0 NaN
6 1 NaN
7 7 NaN
8 8 NaN
9 10 NaN
10 9 NaN
11 4 NaN
12 3 NaN
13 2 NaN
14 0 NaN
where S is the new column I need to calculate, starting from a set value. Next value of S I need to be the previous value of S plus some calculation. Basically, for every step of the calculation I want S(i)=S(i-1)+A(i)x0.5-2* and right after this to check if S(i)=<5 then S(i)=5, elseif S(i)>=10 then S(i)=10. This is one step of the itteration. My code is this with the desired outcome:
for i in range (1,df.shape[0]):
df.S[i]=df.S[i-1]+df.A[i]*0.5-2
if df.S[i]<=5:
df.S[i]=5
elif df.S[i]>10:
df.S[i]=10
df
Out[6]:
A S
0 5 5.0
1 3 5.0
2 7 6.5
3 4 6.5
4 3 6.0
5 0 5.0
6 1 5.0
7 7 6.5
8 8 8.5
9 10 10.0
10 9 10.0
11 4 10.0
12 3 9.5
13 2 8.5
14 0 6.5
This code for 8737 rows takes around 20 mins to complete. Is there any way to do this without a for loop or with a different, faster itterative method? If you need any clarifications, please ask me. Thank you in advance.
You can try numba
for optimalize it:
from numba import jit
@jit(nopython=True)
def get_vals(a, first):
out = np.zeros(a.shape[0], dtype=float)
out[0] = first
for i in range(1, a.shape[0]):
out[i] = out[i-1] + a[i]*0.5-2
out[i] = 5 if out[i]<=5 else out[i]
out[i] = 10 if out[i] > 10 else out[i]
return out
arr = df['A'].values
df['new'] = get_vals(arr, df.loc[0, 'S'])