There is a block of code I use regularly in my analysis to standardize the description of the types of devices used by customers to access an internet provider's services. The block of code is as follows:
# Standardize devices_desc labels
###-- SMARTPHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["SMART PHONE", "SMARTPHONE"], "SMARTPHONE"
)
###-- FEATURE_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["FEATURE PHONE"], "FEATURE_PHONE"
)
###-- BASIC_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["BASIC PHONE", "BASIC"], "BASIC_PHONE"
)
###-- TABLET
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["TABLETS", "TABLET"], "TABLET"
)
###-- MODEM/GSM_GATEWAY
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
],
"MODEM/GSM_GATEWAY",
)
###-- M2M_EQUIPMENT
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["M2M EQUIPMENT"], "M2M_EQUIPMENT"
)
###-- NA/UNKNOWN
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[np.NaN, "UNKNOWN", "NA", "OTHER", "-"], "UNDEFINED"
)
devices_df is a dataframe while devices_desc is a column in devices_df. I use pandas (Anaconda distribution) for my analysis. I have decided to convert this code block into a function to make it reuseable across all the files I use for the analysis. Below is my initial attempt:
def fix_cust_device_type(devices_desc):
if devices_desc in ["BASIC PHONE", "BASIC"]:
return "BASIC_PHONE"
if devices_desc in ["FEATURE PHONE"]:
return "FEATURE_PHONE"
if devices_desc in ["SMART PHONE", "SMARTPHONE"]:
return "SMARTPHONE"
if devices_desc in ["TABLETS", "TABLET"]:
return "TABLET"
if devices_desc in [
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
]:
return "MODEM/GSM_GATEWAY"
if devices_desc in ["M2M EQUIPMENT"]:
return "M2M_EQUIPMENT"
else:
return "UNDEFINED"
I tried to apply the function as follows:
devices_df["devices_desc"] = devices_df["devices_desc"].apply(fix_cust_device_type)
However, I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-abd75c9eeb58> in <module>
----> 1 devices_df["devices_desc"] = GSM_Data["devices_desc"].apply(fix_cust_device_type)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-23-bad83bc1b381> in fix_cust_device_type(devices_desc)
1 def fix_cust_device_type(devices_desc):
----> 2 if devices_desc in ["BASIC PHONE", "BASIC"]:
3 return "BASIC_PHONE"
4
5 if devices_desc in ["FEATURE PHONE"]:
pandas\_libs\missing.pyx in pandas._libs.missing.NAType.__bool__()
**TypeError: boolean value of NA is ambiguous**
Efforts to ascertain the cause(s) of the error have proven abortive. I will like to understand the following:
Kindly assist. Thank you.
It's hard to tell when you didn't provide any data (or some MWE) but from the error message it seems there are missing data (pd.NA
) in your dataframe.
When I try to run your code with simple example, everything works fine, e.g.:
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE"]})
df["devices_desc"].apply(fix_cust_device_type)
# Out:
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
But when I include missing data I get the error you've posted:
df = pd.DataFrame({"devices_desc": ["BASIC", pd.NA]})
df["devices_desc"].apply(fix_cust_device_type)
# --> TypeError: boolean value of NA is ambiguous
Therefore you should check your data. If NA
values are ok, then you should include this in fix_cust_device_type
, e.g. adding the following code in the beginning of the function:
if pd.isna(devices_desc):
return "NA" # or any string according you needs
If NA
values are not ok, you should drop them. E.g. df.dropna()
or df.dropna(subset=["devices_desc"])
.
Another way of dealing with your issue would be following:
# This is a short version just for showcase
replace_dict = {'BASIC': 'BASIC_PHONE', 'DONGLE': 'MODEM/GSM_GATEWAY'}
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE", pd.NA]})
df["devices_desc"] = df["devices_desc"].replace(replace_dict)
# Content of df:
# devices_desc
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
# 2 <NA>