Search code examples
python-3.xpandasdataframeswitch-statementpandas-apply

Issues Converting Python Code Block to Function


There is a block of code I use regularly in my analysis to standardize the description of the types of devices used by customers to access an internet provider's services. The block of code is as follows:

# Standardize devices_desc labels
###-- SMARTPHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["SMART PHONE", "SMARTPHONE"], "SMARTPHONE"
)
###-- FEATURE_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["FEATURE PHONE"], "FEATURE_PHONE"
)
###-- BASIC_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["BASIC PHONE", "BASIC"], "BASIC_PHONE"
)
###-- TABLET
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["TABLETS", "TABLET"], "TABLET"
)
###-- MODEM/GSM_GATEWAY
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    [
        "MODEM/GSM GATEWAY",
        "DONGLE",
        "PLUGGABLE CARD (E.G. USB STICK)",
        "MODEM/GSM GATEWAY",
    ],
    "MODEM/GSM_GATEWAY",
)
###-- M2M_EQUIPMENT
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["M2M EQUIPMENT"], "M2M_EQUIPMENT"
)
###-- NA/UNKNOWN
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    [np.NaN, "UNKNOWN", "NA", "OTHER", "-"], "UNDEFINED"
)

devices_df is a dataframe while devices_desc is a column in devices_df. I use pandas (Anaconda distribution) for my analysis. I have decided to convert this code block into a function to make it reuseable across all the files I use for the analysis. Below is my initial attempt:

def fix_cust_device_type(devices_desc):
    if devices_desc in ["BASIC PHONE", "BASIC"]:
        return "BASIC_PHONE"
    if devices_desc in ["FEATURE PHONE"]:
        return "FEATURE_PHONE"
    if devices_desc in ["SMART PHONE", "SMARTPHONE"]:
        return "SMARTPHONE"
    if devices_desc in ["TABLETS", "TABLET"]:
        return "TABLET"
    if devices_desc in [
        "MODEM/GSM GATEWAY",
        "DONGLE",
        "PLUGGABLE CARD (E.G. USB STICK)",
        "MODEM/GSM GATEWAY",
    ]:
        return "MODEM/GSM_GATEWAY"
    if devices_desc in ["M2M EQUIPMENT"]:
        return "M2M_EQUIPMENT"
    else:
        return "UNDEFINED"

I tried to apply the function as follows:

devices_df["devices_desc"] = devices_df["devices_desc"].apply(fix_cust_device_type)

However, I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-abd75c9eeb58> in <module>
----> 1 devices_df["devices_desc"] = GSM_Data["devices_desc"].apply(fix_cust_device_type)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-23-bad83bc1b381> in fix_cust_device_type(devices_desc)
      1 def fix_cust_device_type(devices_desc):
----> 2     if devices_desc in ["BASIC PHONE", "BASIC"]:
      3         return "BASIC_PHONE"
      4 
      5     if devices_desc in ["FEATURE PHONE"]:

pandas\_libs\missing.pyx in pandas._libs.missing.NAType.__bool__()

**TypeError: boolean value of NA is ambiguous**

Efforts to ascertain the cause(s) of the error have proven abortive. I will like to understand the following:

  1. The causes of the error
  2. How to rectify the error
  3. The pythonic way of implementing my proposed solution

Kindly assist. Thank you.


Solution

  • It's hard to tell when you didn't provide any data (or some MWE) but from the error message it seems there are missing data (pd.NA) in your dataframe.

    When I try to run your code with simple example, everything works fine, e.g.:

    df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE"]})
    df["devices_desc"].apply(fix_cust_device_type)
    
    # Out:
    # 0          BASIC_PHONE
    # 1    MODEM/GSM_GATEWAY
    

    But when I include missing data I get the error you've posted:

    df = pd.DataFrame({"devices_desc": ["BASIC", pd.NA]})
    df["devices_desc"].apply(fix_cust_device_type)
    
    # --> TypeError: boolean value of NA is ambiguous
    

    Therefore you should check your data. If NA values are ok, then you should include this in fix_cust_device_type, e.g. adding the following code in the beginning of the function:

    if pd.isna(devices_desc):
        return "NA"  # or any string according you needs
    

    If NA values are not ok, you should drop them. E.g. df.dropna() or df.dropna(subset=["devices_desc"]).

    Another way of dealing with your issue would be following:

    1. Convert your function to dictionary
    # This is a short version just for showcase
    replace_dict = {'BASIC': 'BASIC_PHONE', 'DONGLE': 'MODEM/GSM_GATEWAY'}
    
    1. Use replace method with the created dict (no need for apply and will work with missing values)
    df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE", pd.NA]})
    df["devices_desc"] = df["devices_desc"].replace(replace_dict)
    
    # Content of df:
    #         devices_desc
    # 0        BASIC_PHONE
    # 1  MODEM/GSM_GATEWAY
    # 2               <NA>