Search code examples
pythonpandasnumpybooleanmissing-data

Using pandas nullable integer dtype in np.where condition


I have a DataFrame below which has some missing values.

df = pd.DataFrame(data=[['A', 1, None], ['B', 2, 5]],
                  columns=['X', 'Y', 'Z'])

Since df['Z'] is supposed to be an integer column, I changed its data type to pandas new experimental type nullable integer as below.

ydf['Z'] = ydf['Z'].astype(pd.Int32Dtype())
ydf

    X   Y   Z
0   A   1   <NA>
1   B   2   5

Now I am trying to use a simple numpy where method to replace the non-null values in the column df['Z'] with a fixed integer value (say 1) using the code below.

np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'] > 0, 1, 0))

But I get the following error, and I am unable to understand why as I am already checking for the rows with null values in the first condition.

TypeError: boolean value of NA is ambiguous

Solution

  • np.where expects an array of booleans. With the int64 dtype, using > on the Series returns False for nans. With the Int32 dtype (note the capital I), > doesn't coerce nans to False, thus the error.

    One solution would be to use ydf['Z'].gt(0).fillna(False) instead of ydf['Z'] > 0. (They're the same, the second one just changes NA to False):

    np.where(pd.isna(ydf['Z']), pd.NA, np.where(ydf['Z'].gt(0).fillna(False), 1, 0))