Search code examples
pandasmachine-learningseries

Handling features with multiple values per instance in Python for Machine Learning model


I am trying to handle my data set which contain some features that has some multiple values per instances as shown on the image
https://i.sstatic.net/D78el.png
I am trying to separate each value by '|' symbol to apply One-Hot encoding technique but I can't find any suitable solution to my problem
My idea is to keep every multiple values in one row or by another word convert each cell to list of integers


Solution

  • Maybe this is what you want:

    df = pd.DataFrame(['465','444','465','864|857|850|843'],columns=['genre_ids'])
    df
    
             genre_ids
    0              465
    1              444
    2              465
    3  864|857|850|843
    
    df['genre_ids'].str.get_dummies(sep='|')
    
       444  465  843  850  857  864
    0    0    1    0    0    0    0
    1    1    0    0    0    0    0
    2    0    1    0    0    0    0
    3    0    0    1    1    1    1