Search code examples
pandasperformancedataframeprocessing-efficiencystartswith

Pandas: A more efficient way to check column startswith and assign new column?


I have a dataframe with 500k lines as follows:

status_code
------------
202
302
403
500
202
.
.
.
------------

I have create a new column as 'status_code_grp' and then checks each row if it starts with "2", I will assign the "status_code_grp" = "200". The repeat this for grp = "300", "400", "500".

I have wrote something like this:

df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
for eachRow in range(len(df)):
    if(df['status_code'][eachRow].startswith['2']):
        df['status_code_grp'][eachRow] = "2xx" 
    elif(df['status_code'][eachRow].startswith['3']):
        df['status_code_grp'][eachRow] = "3xx" 
    elif(df['status_code'][eachRow].startswith['4']):
        df['status_code_grp'][eachRow] = "4xx" 
    elif(df['status_code'][eachRow].startswith['5']):
        df['status_code_grp'][eachRow] = "5xx" 

The for loop took a very long time to complete. Is there a way to be more efficient other then checking row by row with the code above?


Solution

  • Divide by integer division by 100 and multiple by 100:

    df2['status_code_grp'] = df2['status_code'] // 100  * 100
    

    In numpy it is faster, convert Series to array by Series.to_numpy:

    df2 = pd.DataFrame(np.random.randint(200,599,size=(500000, 1)), columns=['status_code'])
    
    In [381]: %timeit df2['status_code_grp1'] = df2['status_code'] // 100  * 100
    12.5 ms ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [382]: %timeit df2['status_code_grp2'] = df2['status_code'].to_numpy() // 100  * 100
    6.62 ms ± 42.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)