So I have a pandas dataframe where certain columns have values of type list and a mix of columns of non-numeric and numeric data.
Example data
dst_address dst_enforcement fwd_count ...
1 1.2.3.4 [Any,core] 8
2 3.4.5.6 [] 9
3 6.7.8.9 [Any] 10
4 8.10.3.2 [core] 0
So far I've been able to find out which columns are non-numeric by these 2 lines of code
col_groups = df.columns.to_series().groupby(df.dtypes).groups
non_numeric_cols = col_groups[np.dtype('O')]
Of all these non-numeric columns, I need to figure out which ones have list as data type and I want to perform one-hot encoding on all non-numeric columns (including those list type)
EDIT: my expected output for above example would be something like
1.2.3.4 | 3.4.5.6 | 6.7.8.9 | 8.10.3.2 | empty | Any | core | fwd_count ...
1 1 0 0 0 0 1 1 8
2 0 1 0 0 1 0 0 9
3 0 0 1 0 0 1 0 10
4 0 0 0 1 0 0 1 0
I use 3 steps as follows:
df['dst_enforcement'] = df.dst_enforcement.apply(lambda x: x if x else ['empty'])
dm1 = pd.get_dummies(df[df.columns.difference(['dst_enforcement'])], prefix='', prefix_sep='')
dm2 = df.dst_enforcement.str.join('-').str.get_dummies('-')
pd.concat([dm1, dm2], axis=1)
Out[1221]:
fwd_count 1.2.3.4 3.4.5.6 6.7.8.9 8.10.3.2 Any core empty
1 8 1 0 0 0 1 1 0
2 9 0 1 0 0 0 0 1
3 10 0 0 1 0 1 0 0
4 0 0 0 0 1 0 1 0