How do I iterate through a Pandas dataframe with conditions? (confusion over iterrows/for loops/vectorization)

I have a dataset I need to iterate on a condition:

data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])


for x in range(len(df))
    if df['state'] and df['capital'] = np.nan:
        df['state'] = 'Investigate state'
        df['capital'] = 'Investigate capital'

My expected output is that if the state field and capital fields are both empty then fill in the empty fields respectively. The actual data I use and the function within the loop is more complex than this example but what I want to focus on is the iterative/looping portion with the condition.

My Googling found iterrows and I read tutorials that just say to go ahead and use a for loop. Stackoverflow answers denounced the two options above and advocated to use vectorization instead. My actual dataset will have around ~20,000 rows. What is the most efficient implementation and how do I implement it?

Solution

You can test each column separately and chain masks by & for bitwise AND:

m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']

Fastest is in sample data for 30k rows and 66% matching if also set columns separately:

m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])

Similar:

m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)


%%timeit
    ...: m = df['state'].isna() & df['capital'].isna()
    ...: df['state']= np.where(m, 'Investigate state', df['state'])
    ...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
    ...: 
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'


3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']

4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another solutions:

%%timeit 
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})

6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit 
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'


4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)