I have a dataframe with 500k lines as follows:
status_code
------------
202
302
403
500
202
.
.
.
------------
I have create a new column as 'status_code_grp' and then checks each row if it starts with "2", I will assign the "status_code_grp" = "200". The repeat this for grp = "300", "400", "500".
I have wrote something like this:
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
for eachRow in range(len(df)):
if(df['status_code'][eachRow].startswith['2']):
df['status_code_grp'][eachRow] = "2xx"
elif(df['status_code'][eachRow].startswith['3']):
df['status_code_grp'][eachRow] = "3xx"
elif(df['status_code'][eachRow].startswith['4']):
df['status_code_grp'][eachRow] = "4xx"
elif(df['status_code'][eachRow].startswith['5']):
df['status_code_grp'][eachRow] = "5xx"
The for loop took a very long time to complete. Is there a way to be more efficient other then checking row by row with the code above?
Divide by integer division by 100
and multiple by 100
:
df2['status_code_grp'] = df2['status_code'] // 100 * 100
In numpy it is faster, convert Series to array by Series.to_numpy
:
df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
In [381]: %timeit df2['status_code_grp1'] = df2['status_code'] // 100 * 100
12.5 ms ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [382]: %timeit df2['status_code_grp2'] = df2['status_code'].to_numpy() // 100 * 100
6.62 ms ± 42.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)