Search code examples
pythonpandasconditional-statementsseries

How to make new pandas column based on multiple conditionals including 'isnull', 'or' and if colB 'isin' colA -like statements?


First time asking. Is there a way to get a new df column including all three statements (or, isnull-like, isin-like) without iterating over a for loop/ keeping code within the spirit of Pandas? I've tried advice from several threads dealing with individual aspects of common conditional problems, but every iteration I've tried usually leads me to "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." or produces incorrect results. Below is example data and code from several attempts. My goal is to produce a dataframe (printed below) where the last (new) column concatenates all words from the company and unit columns, (1) without any NaNs (note: 'unit_desc' contains no null values irl, so NaNs in 'comp_unit' mean my function isn't working properly) and (2) without repeating the company name (because sometimes 'unit_desc' already [incorrectly] contains the company name, as with example row 2).

Desired dataframe

company unit_desc comp_new comp_unit
Generic Some description NaN Some description
NaN Unit with features NaN Unit with features
Some LLC Some LLC Xtra cool space Some LLC Some LLC Xtra cool space
Another LLC Unit with features Another LLC Another LLC Unit with features
Another LLC Basic unit Another LLC Another LLC Basic unit
Some LLC basic unit Some LLC Some LLC basic unit

Imports and initial example df

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'], 
    'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
    })

ATTEMPT 0: Uses np.where
ATTEMPT 0 Results: ValueError as above

def my_func(df, unit, comp, bad_info_list):
    """Return new dataframe with new column combining company and unit descriptions

    Args:
        df (DataFrame): Pandas dataframe with product and brand info
        unit (str): name of unit description column
        comp (str): name of company name column
        bad_info_list (list): list of unwanted terms
    """

    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    # Make new column with brand and product descriptions
    df["comp_unit"] = np.where(
        (df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
        df[unit],
        (df["comp_new"] + " " + df[unit]),
    )
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)  

ATTEMPT 1: Uses np.where with ValueError suggestions as indicated with inline comments
ATTEMPT 1 Results:

  • Using .all(): Seems to consider all matches over entire Series, so produces wrong results
  • Using .any(): Seems to consider any matches over entire Series, so produces wrong results
  • Using .item(): Seems to check size of entire Series, so produces ValueError: can only convert an array of size 1 to a Python scalar
  • Using .bool(): Returns same ValueError as before
def my_func(df, unit, comp, bad_info_list):

    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    # Make new column with brand and product descriptions
    df["comp_unit"] = np.where(
        ((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
        df[unit],
        (df["comp_new"] + " " + df[unit]),
    )
    # (!!!END OF NOT WORKING!!!)
    
    return df


df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

ATTEMPT 1.5: Same as 1 except .isnull().all() is swapped with == np.nan
ATTEMPT 1.5: Incorrect results
I found it odd that I got no ambiguity errors with the isin statement---perhaps it's not working as intended?

ATTEMPT 2: Uses if/elif/else and different suggestions from ValueError
Seems fixable using for loops for each conditional, but shouldn't there be another way?
ATTEMPT 2 Results: see bullet points from ATTEMPT 1

def my_func(df, unit, comp, bad_info_list):
    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
        df["comp_unit"] = df[unit]
    elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
        df["comp_unit"] = df[unit]
    else:
        df["comp_unit"] = df["comp_new"] + " " + df[unit]
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

ATTEMPT 3: Uses if/elif/else combined with apply
ATTEMPT 3 Results: AttributeError: 'float' object has no attribute 'isin'

bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)

def comp_unit_merge(df):
    if df["comp_new"] == np.nan: #.isnull().item():
        return df["unit_desc"]
    elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
        return df["unit_desc"]
    else:
        return df["comp_new"] + " " + df["unit_desc"]
    
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)

ATTEMPT 4: Uses np.select(conditions, values)
ATTEMPT 4 Result: Incorrect results
Company name not included in last few rows

def my_func(df, unit, comp, bad_info_list):
    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    conditions = [
        ((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
        (df["comp_new"] != np.nan),
    ]
    values = [
        (df[unit]),
        (df["comp_new"] + " " + df[unit]),
    ]
    df["comp_unit"] = np.select(conditions, values)
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

Solution

  • When using axis=1, the applied function receives a single row as an argument. Indexing into the row gives you string objects, in most cases -- except where a NaN is encountered.

    Numpy NaNs are actually floats. So when you attempt to perform string operations on the company column, like checking whether the unit_desc contains the company, this throws errors for rows that contain NaNs.

    Numpy has a function isnan, but calling this function on a string also throws an error. So any rows that have an actual company value will cause problems for that check.

    You could check the type of the data with isinstance, or you could just remove the NaNs from your data ahead of time.


    This example removes the NaNs ahead of time.

    badlist=["Generic", "Name"]
    
    def merge(row):
        if row['company'] in badlist:
            return row['unit_desc']
        if row['company'] in row['unit_desc']:
            return row['unit_desc']
        return f"{row['company']} {row['unit_desc']}".strip()
    
    df['company'] = df['company'].fillna('')
    df['comp_unit'] = df.apply(merge, axis=1)
    print(df)
    

    Here's an online runnable version.


    Here's an alternative that safely detects the NaNs:

    badlist=["Generic", "Name"]
    
    def merge(row):
        if isinstance(row['company'], float) and np.isnan(row['company']):
            return row['unit_desc']
        if row['company'] in badlist:
            return row['unit_desc']
        if row['company'] in row['unit_desc']:
            return row['unit_desc']
        return f"{row['company']} {row['unit_desc']}".strip()
    
    df['comp_unit'] = df.apply(merge, axis=1)
    print(df)
    

    Here's an online runnable version.