Search code examples
pythonpython-3.xpandasdataframeone-hot-encoding

One-hot encoding for list variable with customized delimiter and new column names


My data:

Rank    Platforms        Technology

high    Windows||Linux   Unity
high    Linux             
low     Windows          Unreal 
low     Linux||MacOs     GameMakerStudio||Unity||Unreal
low                      GameMakerStudio
low

I want to convert it to something like this:

Rank    platform_Windows  platform_linux  platform_MacOs technology_unity  technology_unreal technology_GameMakerStudio

high    1                 0                0             1                  0                   1
high    0                 1                0             0                  0                   0
low     1                 0                0             0                  1                   0 
low     0                 1                1             1                  1                   1 
low     0                 0                0             0                  0                   1
low     0                 0                0             0                  0                   0

So it's sort of one-hot encoding. I have followed many answers:

  1. How to one-hot-encode from a pandas column containing a list?
  2. Pandas get_dummies to create one hot with separator = ' ' and with character level separation [duplicate]
  3. ow to one-hot-encode from a pandas column containing a list?

The issues are:

  • none of them shows how to separate my list by || delimiter
  • none of them shows how to prefix the new column name. For example platform_ and technology_. I need this to know which original column the new column comes from.

My current code is:

df.drop('Platforms', 1).join(
    pd.get_dummies(
        pd.DataFrame(df.Platforms.str.split("||").tolist()).stack(),
        prefix=['platform']
    ).assum(level=0)
)

df.drop('Technology', 1).join(
    pd.get_dummies(
        pd.DataFrame(df.Technology.str.split("||").tolist()).stack(),
        prefix=['technology']
    ).assum(level=0)
)

But the error I get is:

TypeError: object of type 'float' has no len()

I have read the document pandas.get_dummies and pandas.Series.str.get_dummies. The latter seems to accept a customized delimiter while the former allows customized new column prefixes...


Solution

  • You can do:

    s = [df[col].str.get_dummies().add_prefix(f'{col.lower()}_') 
            for col in ['Platforms', 'Technology']]
    
    pd.concat([df[['Rank']]] + s, axis=1)