I have a data frame column as float64 full of NaN
values, If I cast it again to float64 they got substituted for <NA>
values which are not the same.
I know that the <NA>
values are pd.NA
, while NaN
values are np.nan
, so they are different things. So why casting an already float64 column to float64 changed NaN
to <Na>
?
Here's an example:
df=pd.DataFrame({'a':[1.0,2.0]})
print(df.dtypes)
#output is: float64
df['a'] = np.nan
print(df.dtypes)
# output is float64
print(df)
a
0 NaN
1 NaN
#Now, lets cast that float64 to float 64
df3['a']=df3['a'].astype(pd.Float64DType())
print(df3.dtypes)
#output is Float64, notice it's uppercase F this time, previously it was lowercase
print(df3)
a
0 <NA>
1 <NA>
it seems float64
and Float64
are two different things. And NaN
(np.nan) is the null value for float64
while <NA>
(pd.NA) is the null for Float64
Is this correct? And if so, what's under the hoods?
Yes, you are correct. float64 and Float64 are two different data types in pandas. The difference is that Float64 is an extension type that can hold missing values using a special sentinel, while float64 is a native numpy type that uses NaN to represent missing values. Under the hood, Float64 uses a numpy array with dtype object to store the values, while float64 uses a numpy array with dtype float64. This means that Float64 may have some performance overhead compared to float64, but it also allows more consistent handling of missing values across different data types.
Check this out: Numpy float64 vs Python float