Search code examples
pythonpandasnanone-hot-encoding

Transfering NaN's to Dummy Variables While Using One Hot Encoder


I am using OneHotEncoder to create a series of dummy variables based on a categoric variable. The problem I encounter is that any missing values are not transfered to the available dummy variables.

oh = OneHotEncoder(min_frequency = 0.0001, sparse_output = False) ### df_experience_level
data = oh.fit_transform(df[['experience_level']])
cols=oh.get_feature_names_out()
df_experience_level = pd.DataFrame(data,columns=cols)
missing_values_count = df_experience_level.isnull().sum()
missing_values_count

output:

experience_level_EN     0
experience_level_EX     0
experience_level_MI     0
experience_level_SE     0
experience_level_nan    0
dtype: int64

The current code that I use is:

df.loc[df['experience_level'].isna(), 'experience_level_EN'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_EX'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_MI'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_SE'] = np.nan

However this is tedious.

Running the obvious:

df.loc[df['experience_level'].isna(), df_experience_level] = np.nan

Results in:

ValueError: Index data must be 1-dimensional

Is there any way to transfer the NaNs from the parent variable to each dummy variable in a single statement?


Solution

  • You can try with str.startswith to select the correct column subset:

    # Obviously replace exp_lvl with experience_level
    subset = df_exp_lvl.columns.str.startswith('exp_lvl')
    df_exp_lvl.loc[df['exp_lvl'].isna(), subset] = np.nan
    

    Output:

    >>> df_exp_lvl
        exp_lvl_EN  exp_lvl_EX  exp_lvl_MI  exp_lvl_SE  exp_lvl_nan
    0          1.0         0.0         0.0         0.0          0.0
    1          0.0         0.0         1.0         0.0          0.0
    2          1.0         0.0         0.0         0.0          0.0
    3          1.0         0.0         0.0         0.0          0.0
    4          0.0         0.0         0.0         1.0          0.0
    5          NaN         NaN         NaN         NaN          NaN
    6          0.0         1.0         0.0         0.0          0.0
    7          0.0         1.0         0.0         0.0          0.0
    8          NaN         NaN         NaN         NaN          NaN
    9          0.0         0.0         0.0         1.0          0.0
    10         0.0         0.0         1.0         0.0          0.0
    11         0.0         0.0         1.0         0.0          0.0
    12         1.0         0.0         0.0         0.0          0.0
    13         NaN         NaN         NaN         NaN          NaN
    14         0.0         1.0         0.0         0.0          0.0
    
    >>> df
       exp_lvl
    0       EN
    1       MI
    2       EN
    3       EN
    4       SE
    5      NaN
    6       EX
    7       EX
    8      NaN
    9       SE
    10      MI
    11      MI
    12      EN
    13     NaN
    14      EX