Search code examples
pythonpandasnumpynullna

Difference between pandas <NA> and NaN for numeric columns


I have a data frame column as float64 full of NaN values, If I cast it again to float64 they got substituted for <NA> values which are not the same.

I know that the <NA> values are pd.NA, while NaN values are np.nan , so they are different things. So why casting an already float64 column to float64 changed NaN to <Na> ?

Here's an example:

df=pd.DataFrame({'a':[1.0,2.0]})
print(df.dtypes)
#output is: float64

df['a'] = np.nan
print(df.dtypes)
# output is float64

print(df)
    a
0   NaN
1   NaN

#Now, lets cast that float64 to float 64
df3['a']=df3['a'].astype(pd.Float64DType())
print(df3.dtypes)
#output is Float64, notice it's uppercase F this time, previously it was lowercase

print(df3)

    a
0   <NA>
1   <NA>

it seems float64 and Float64 are two different things. And NaN (np.nan) is the null value for float64 while <NA> (pd.NA) is the null for Float64

Is this correct? And if so, what's under the hoods?


Solution

  • Yes, you are correct. float64 and Float64 are two different data types in pandas. The difference is that Float64 is an extension type that can hold missing values using a special sentinel, while float64 is a native numpy type that uses NaN to represent missing values. Under the hood, Float64 uses a numpy array with dtype object to store the values, while float64 uses a numpy array with dtype float64. This means that Float64 may have some performance overhead compared to float64, but it also allows more consistent handling of missing values across different data types.

    Check this out: Numpy float64 vs Python float