I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.
Here's some sample code for the fillna option:
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key_2': ['K0.1', 'K1.1', 'K2.1'],
'B': ['B0', 'B1', 'B2']},index=[0,2,3])
result = df.join([other])
After this line the joined dataframe looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 NaN NaN
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
and after doing the fillna with
result[['key','key_2']] = result[['key','key_2']].fillna('K0.0')
result[['A','B']] = result[['A','B']].fillna('B0.0')
it looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
Using the replace instead,
result[['key','key_2']] = result[['key','key_2']].replace(np.nan,'K0.0')
result[['A','B']] = result[['A','B']].replace(np.nan,'B0.0')
The resulting dataframe is:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
As you can see, they both achieve the same result, at least as far as I've been able to test.
I have 2 questions:
Empty values in pandas are often represented with np.nan
, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:
The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.
For efficiency, they seem fairly similar:
fillna
replace
However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna
and let pandas handle the missing values. Also, from a readability standpoint, fillna
is shorter and clearer than replace(np.nan, ...)
.