First time asking. Is there a way to get a new df column including all three statements (or, isnull-like, isin-like) without iterating over a for
loop/ keeping code within the spirit of Pandas? I've tried advice from several threads dealing with individual aspects of common conditional problems, but every iteration I've tried usually leads me to "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." or produces incorrect results. Below is example data and code from several attempts. My goal is to produce a dataframe (printed below) where the last (new) column concatenates all words from the company and unit columns, (1) without any NaNs (note: 'unit_desc' contains no null values irl, so NaNs in 'comp_unit' mean my function isn't working properly) and (2) without repeating the company name (because sometimes 'unit_desc' already [incorrectly] contains the company name, as with example row 2).
Desired dataframe
company | unit_desc | comp_new | comp_unit |
---|---|---|---|
Generic | Some description | NaN | Some description |
NaN | Unit with features | NaN | Unit with features |
Some LLC | Some LLC Xtra cool space | Some LLC | Some LLC Xtra cool space |
Another LLC | Unit with features | Another LLC | Another LLC Unit with features |
Another LLC | Basic unit | Another LLC | Another LLC Basic unit |
Some LLC | basic unit | Some LLC | Some LLC basic unit |
Imports and initial example df
import pandas as pd
import numpy as np
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
ATTEMPT 0: Uses np.where
ATTEMPT 0 Results: ValueError as above
def my_func(df, unit, comp, bad_info_list):
"""Return new dataframe with new column combining company and unit descriptions
Args:
df (DataFrame): Pandas dataframe with product and brand info
unit (str): name of unit description column
comp (str): name of company name column
bad_info_list (list): list of unwanted terms
"""
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
(df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1: Uses np.where
with ValueError suggestions as indicated with inline comments
ATTEMPT 1 Results:
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1.5: Same as 1 except .isnull().all()
is swapped with == np.nan
ATTEMPT 1.5: Incorrect results
I found it odd that I got no ambiguity errors with the isin
statement---perhaps it's not working as intended?
ATTEMPT 2: Uses if/elif/else and different suggestions from ValueError
Seems fixable using for loops for each conditional, but shouldn't there be another way?
ATTEMPT 2 Results: see bullet points from ATTEMPT 1
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
df["comp_unit"] = df[unit]
elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
df["comp_unit"] = df[unit]
else:
df["comp_unit"] = df["comp_new"] + " " + df[unit]
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 3: Uses if/elif/else combined with apply
ATTEMPT 3 Results: AttributeError: 'float' object has no attribute 'isin'
bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)
def comp_unit_merge(df):
if df["comp_new"] == np.nan: #.isnull().item():
return df["unit_desc"]
elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
return df["unit_desc"]
else:
return df["comp_new"] + " " + df["unit_desc"]
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)
ATTEMPT 4: Uses np.select(conditions, values)
ATTEMPT 4 Result: Incorrect results
Company name not included in last few rows
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
conditions = [
((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
(df["comp_new"] != np.nan),
]
values = [
(df[unit]),
(df["comp_new"] + " " + df[unit]),
]
df["comp_unit"] = np.select(conditions, values)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
When using axis=1, the applied function receives a single row as an argument. Indexing into the row gives you string objects, in most cases -- except where a NaN is encountered.
Numpy NaNs are actually floats. So when you attempt to perform string operations on the company column, like checking whether the unit_desc contains the company, this throws errors for rows that contain NaNs.
Numpy has a function isnan
, but calling this function on a string also throws an error. So any rows that have an actual company value will cause problems for that check.
You could check the type of the data with isinstance
, or you could just remove the NaNs from your data ahead of time.
This example removes the NaNs ahead of time.
badlist=["Generic", "Name"]
def merge(row):
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['company'] = df['company'].fillna('')
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
Here's an online runnable version.
Here's an alternative that safely detects the NaNs:
badlist=["Generic", "Name"]
def merge(row):
if isinstance(row['company'], float) and np.isnan(row['company']):
return row['unit_desc']
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['comp_unit'] = df.apply(merge, axis=1)
print(df)