Search code examples
pandasmissing-data

Pandas: what is truth value condition when dealing with missing data


I have a function that creates a ratio. It is defined as

def create_ratio(data,num,den):
    if data[num].isnull():
        ratio = -9997
    if data[den].isnull():
        ratio = -9998
    if data[num].isnull() & data[den].isnull():
        ratio = -9999
    else:
        ratio = data[num]/data[den]
    return ratio

I have pandas dataframe (df_credit) which includes credit card balance (cc_bal) and limit (cc_limit) and I want to calculate credit card utilization which is balance over limit

df_credit['cc_util'] = create_ratio(df_credit,'cc_bal','cc_limit')

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-d53809a7690d> in <module>
----> 1 data['ratio_cc_util'] = create_ratio(data,'open_credit_card_credit_limit_nomiss','open_credit_card_credit_limit_nomiss')
      2 data['ratio_cc_util'].hist()

<ipython-input-65-99bc55b184ed> in create_ratio(data, num, den)
      1 def create_ratio(data,num,den):
----> 2     if data[num].isnull():
      3         ratio = -9997
      4     if data[den].isnull():
      5         ratio = -9998

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1441     def __nonzero__(self):
   1442         raise ValueError(
-> 1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1445         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What is the solution to this error? Thanks.


Solution

    • you are mixing between scalars and series, where your function needs to return a series or array given its calling context
    • as simple way as any to implement this conditional logic is np.select()
    • have simulated data, including missing values to meet your use cases
    df = pd.DataFrame({
            "cc_bal": np.random.uniform(200, 1000, 200),
            "cc_limit": np.random.uniform(800, 1200, 200),})
    
    df.loc[np.unique(np.random.choice(range(len(df)), 30)), "cc_bal"] = None
    df.loc[np.unique(np.random.choice(range(len(df)), 30)), "cc_limit"] = None
    
    
    def create_ratio(df, num, den):
        return np.select(
            [
                df[num].isnull() & df[den].isnull(),
                df[num].isnull(),
                df[den].isnull(),
            ],
            [-9999, -9997, -9998],
            df[num] / df[den],
        )
    
    
    df["ratio"] = create_ratio(df, "cc_bal", "cc_limit")
    df
    

    sample output

    cc_bal cc_limit ratio
    0 372.633 981.996 0.379465
    1 845.541 1133.69 0.745831
    2 449.406 975.903 0.460503
    3 209.827 922.829 0.227374
    4 237.347 936.654 0.253398
    5 351.154 nan -9998
    6 nan 873.671 -9997
    7 803.396 861.791 0.93224
    8 591.136 807.176 0.732352
    9 675.397 847.059 0.797344