Search code examples
pythonpandasone-hot-encodingdata-processing

Perform one-hot encoding on pandas dataframe on multiple column types


So I have a pandas dataframe where certain columns have values of type list and a mix of columns of non-numeric and numeric data.

Example data

   dst_address   dst_enforcement   fwd_count ...
1  1.2.3.4       [Any,core]        8
2  3.4.5.6       []                9
3  6.7.8.9       [Any]             10
4  8.10.3.2      [core]            0

So far I've been able to find out which columns are non-numeric by these 2 lines of code

col_groups = df.columns.to_series().groupby(df.dtypes).groups
non_numeric_cols = col_groups[np.dtype('O')]

Of all these non-numeric columns, I need to figure out which ones have list as data type and I want to perform one-hot encoding on all non-numeric columns (including those list type)

EDIT: my expected output for above example would be something like

   1.2.3.4 | 3.4.5.6 | 6.7.8.9 | 8.10.3.2 | empty | Any | core | fwd_count ...
1  1         0         0         0          0       1     1      8
2  0         1         0         0          1       0     0      9
3  0         0         1         0          0       1     0      10
4  0         0         0         1          0       0     1      0    

Solution

  • I use 3 steps as follows:

    df['dst_enforcement'] = df.dst_enforcement.apply(lambda x: x if x else ['empty'])
    dm1 = pd.get_dummies(df[df.columns.difference(['dst_enforcement'])], prefix='', prefix_sep='')
    dm2 = df.dst_enforcement.str.join('-').str.get_dummies('-')
    pd.concat([dm1, dm2], axis=1)
    
    Out[1221]:
       fwd_count  1.2.3.4  3.4.5.6  6.7.8.9  8.10.3.2  Any  core  empty
    1          8        1        0        0         0    1     1      0
    2          9        0        1        0         0    0     0      1
    3         10        0        0        1         0    1     0      0
    4          0        0        0        0         1    0     1      0