Search code examples
pandaslistnull

Check if specific value in a series is NaN


I have a Dataframe where I want to derive a Boolean column based on some other columns, including a check on whether a particular column is NaN:

def secret_sauce(my_row):
    if (pd.isna(my_row["col1"]) and
        my_row["col2"] != "BAD_VAL"):
        return False
    return True

my_df = my_df.assign(
    my_boolean_col=lambda df: df.apply(
        secret_sauce, axis=1
    )
)

col1 can be either a list or NaN. I keep running into problems with various ways to check if it is NaN, because for example pd.isna() does not like being called on a list.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

However, any() or all() does not seem to be correct for calling on NaN. So I seem to be stuck. What would be the best way to do this?

The following setup can reproduce the issue:

import pandas as pd
import numpy as np

d = {
    'mypk': ['ABC', 'DEF', 'GHI'],
    'col1': [["HELLO"], np.nan, ["GOODBYE", "ADIOS"]],
    'col2': ['GOOD_VAL', 'GOOD_VAL', 'GOOD_VAL']}
my_df = pd.DataFrame(data=d)
  mypk              col1      col2
0  ABC           [HELLO]  GOOD_VAL
1  DEF               NaN  GOOD_VAL
2  GHI  [GOODBYE, ADIOS]  GOOD_VAL

What I want is a new column with values True, False, True, respectively for the 3 rows. It processes the first 2 rows fine, but fails when it gets to the 3rd row with multiple elements in col1. It can process the list with a single element fine.

So what I'm wondering is, what's the best way to have a condition that is True if it's NaN, and False otherwise, in this context? The different ways I've found so far don't seem to properly handle the possible different input values.


Solution

  • For me, this is an XY problem because you don't need to use .apply() in the first place; instead you should use isna and != on the whole columns instead of single elements, and in that case, the problem disappears.

    my_df.assign(
        my_boolean_col=lambda df: ~(df['col1'].isna() & df['col2'].ne('BAD_VAL'))
    )
    
      mypk              col1      col2  my_boolean_col
    0  ABC           [HELLO]  GOOD_VAL            True
    1  DEF               NaN  GOOD_VAL           False
    2  GHI  [GOODBYE, ADIOS]  GOOD_VAL            True
    

    Check out cs95's answer on "How can I iterate over rows in a Pandas DataFrame?", which says, first prefer vectorized methods (as I'm using here), and then .apply() is lower down the list.