Search code examples
pythonpandasloopsvectorization

How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)


I have a dataset I need to iterate on a condition:

data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])


for x in range(len(df))
    if df['state'] and df['capital'] = np.nan:
        df['state'] = 'Investigate state'
        df['capital'] = 'Investigate capital'

My expected output is that if the state field and capital fields are both empty then fill in the empty fields respectively. The actual data I use and the function within the loop is more complex than this example but what I want to focus on is the iterative/looping portion with the condition.

My Googling found iterrows and I read tutorials that just say to go ahead and use a for loop. Stackoverflow answers denounced the two options above and advocated to use vectorization instead. My actual dataset will have around ~20,000 rows. What is the most efficient implementation and how do I implement it?


Solution

  • You can test each column separately and chain masks by & for bitwise AND:

    m = df['state'].isna() & df['capital'].isna()
    df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']
    

    Fastest is in sample data for 30k rows and 66% matching if also set columns separately:

    m = df['state'].isna() & df['capital'].isna()
    df['state']= np.where(m, 'Investigate state', df['state'])
    df['capital']= np.where(m, 'Investigate capital', df['capital'])
    

    Similar:

    m = df['state'].isna() & df['capital'].isna()
    df.loc[m, 'state']='Investigate state'
    df.loc[m, 'capital']='Investigate capital'
    

    #30k rows
    df = pd.concat([df] * 10000, ignore_index=True)
    
    
    %%timeit
        ...: m = df['state'].isna() & df['capital'].isna()
        ...: df['state']= np.where(m, 'Investigate state', df['state'])
        ...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
        ...: 
    3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit 
    m = df['state'].isna() & df['capital'].isna()
    df.loc[m,'state']='Investigate state'
    df.loc[m,'capital']='Investigate capital'
    
    
    3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit
    m = df['state'].isna() & df['capital'].isna()
    df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']
    
    4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Another solutions:

    %%timeit 
    m=df[['state','capital']].isna().all(1)
    df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})
    
    6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    %%timeit 
    m=df[['state','capital']].isna().all(1)
    df.loc[m,'state']='Investigate state'
    df.loc[m,'capital']='Investigate capital'
    
    
    4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)