Search code examples
pythonpandasdataframeperformance

Pandas replace(np.nan, value) vs fillna(value) which is faster?


I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.

Here's some sample code for the fillna option:

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key_2': ['K0.1', 'K1.1', 'K2.1'],
                      'B': ['B0', 'B1', 'B2']},index=[0,2,3])

result = df.join([other])

After this line the joined dataframe looks like this:

  key   A key_2    B
0  K0  A0  K0.1   B0
1  K1  A1   NaN  NaN
2  K2  A2  K1.1   B1
3  K3  A3  K2.1   B2
4  K4  A4   NaN  NaN
5  K5  A5   NaN  NaN

and after doing the fillna with

result[['key','key_2']] = result[['key','key_2']].fillna('K0.0')
result[['A','B']] = result[['A','B']].fillna('B0.0')

it looks like this:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

Using the replace instead,

result[['key','key_2']] = result[['key','key_2']].replace(np.nan,'K0.0')
result[['A','B']] = result[['A','B']].replace(np.nan,'B0.0')

The resulting dataframe is:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

As you can see, they both achieve the same result, at least as far as I've been able to test.

I have 2 questions:

  1. What kind of NaN does join create (seeing as np.nan is found, I think it's that one, but I want to be sure to catch every NaN created by the join method)
  2. Which one is faster, fillna or replace?

Solution

  • Empty values in pandas are often represented with np.nan, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:

    The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.

    For efficiency, they seem fairly similar:

    • fillna

      • time for 10000 runs: 24.815383911132812
      • average time per run: 0.0024815383911132812
    • replace

      • time for 10000 runs: 20.818645477294922
      • average time per run: 0.002081864547729492

    However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna and let pandas handle the missing values. Also, from a readability standpoint, fillna is shorter and clearer than replace(np.nan, ...).