I have a dataset I need to iterate on a condition:
data = [[-10, 10, 'Hawaii', 'Honolulu'], [-22, 63], [32, -14]]
df = pd.DataFrame(data, columns = ['lat', 'long', 'state', 'capital'])
for x in range(len(df))
if df['state'] and df['capital'] = np.nan:
df['state'] = 'Investigate state'
df['capital'] = 'Investigate capital'
My expected output is that if the state field and capital fields are both empty then fill in the empty fields respectively. The actual data I use and the function within the loop is more complex than this example but what I want to focus on is the iterative/looping portion with the condition.
My Googling found iterrows and I read tutorials that just say to go ahead and use a for loop. Stackoverflow answers denounced the two options above and advocated to use vectorization instead. My actual dataset will have around ~20,000 rows. What is the most efficient implementation and how do I implement it?
You can test each column separately and chain masks by &
for bitwise AND
:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, ['capital', 'state']] = ['Investigate capital','Investigate state']
Fastest is in sample data for 30k rows and 66% matching if also set columns separately:
m = df['state'].isna() & df['capital'].isna()
df['state']= np.where(m, 'Investigate state', df['state'])
df['capital']= np.where(m, 'Investigate capital', df['capital'])
Similar:
m = df['state'].isna() & df['capital'].isna()
df.loc[m, 'state']='Investigate state'
df.loc[m, 'capital']='Investigate capital'
#30k rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
...: m = df['state'].isna() & df['capital'].isna()
...: df['state']= np.where(m, 'Investigate state', df['state'])
...: df['capital']= np.where(m, 'Investigate capital', df['capital'])
...:
3.45 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
3.58 ms ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m = df['state'].isna() & df['capital'].isna()
df.loc[m,['capital', 'state']] = ['Investigate capital','Investigate state']
4.5 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another solutions:
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m]=df.loc[m].fillna({'state':'Investigate state','capital':'Investigate capital'})
6.68 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
m=df[['state','capital']].isna().all(1)
df.loc[m,'state']='Investigate state'
df.loc[m,'capital']='Investigate capital'
4.72 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)