Search code examples
pythonnanmissing-datadummy-variable

How to create a Dummy Variable in Python if Missing Values are included?


How to create a dummy variable if missing values are included? I have the following data and I want to create a Dummy variable based on several conditions. My problem is that it automatically converts my missing values to 0, but I want to keep them as missing values.

import pandas as pd

mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62], 
          'y' : [10, 1, 5,  np.nan, 47, np.nan, 8, 5, 100, 3]}
df = pd.DataFrame(mydata)

df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype(int)

print(df)

Solution

  • When creating your boolean-mask, you are comparing integers with nans. In your case, when comparing df['x']=np.nan with 50, your mask df['x'] >= 50 will always be False and will equal 0 if you convert it to an integer. You can just create a boolean-mask that equals True for all rows that contain any np.nan in the columns ['x', 'y'] and then assign np.nan to these rows.

    Code:

    import pandas as pd
    import numpy as np
    
    mydata = {'x' : [10, 50, np.nan, 32, 47, np.nan, 20, 5, 100, 62], 
              'y' : [10, 1, 5,  np.nan, 47, np.nan, 8, 5, 100, 3]}
    df = pd.DataFrame(mydata)
    
    df["z"] = ((df["x"] >= 50) & (df["y"] <= 20)).astype("uint32")
    df.loc[df[["x", "y"]].isna().any(axis=1), "z"] = np.nan
    

    Output:

        x       y       z
    0   10.0    10.0    0.0
    1   50.0    1.0     1.0
    2   NaN     5.0     NaN
    3   32.0    NaN     NaN
    4   47.0    47.0    0.0
    5   NaN     NaN     NaN
    6   20.0    8.0     0.0
    7   5.0     5.0     0.0
    8   100.0   100.0   0.0
    9   62.0    3.0     1.0
    

    Alternatively, if you want an one-liner, you could use nested np.where statements:

    df["z"] = np.where(
        df.isnull().any(axis=1), np.nan, np.where((df["x"] >= 50) & (df["y"] <= 20), 1, 0)
    )