Search code examples
pythonpandasdataframesplitdummy-variable

Separating categories within one column in my dataframe


I need to research something about what are the most cost efficient movie genres. My problem is that the genres are provided all within one string:

list of movie genres

This gives me about 300 different unique categories. How can I split these into about 12 original dummy genre columns so I can analyse each main genre?


Solution

  • Thanks to Yong Wang who suggested the get_dummies function within pandas. We can shorten the code significantly:

    df = pd.DataFrame({
        'movie_id': range(5),
        'gernes': [
                    'Action|Adventure|Fantasy|Sci-Fi',
                    'Action|Adventure|Fantasy',
                    'Action|Adventure|Thriller',
                    'Action|Thriller',
                    'Action|Adventure|Sci-Fi'
                  ]
    })  
    dummies = df['gernes'].str.get_dummies(sep='|')
    final = pd.concat([df, dummies], axis=1)
    

    Result:

       movie_id                           gernes  Action  Adventure  Fantasy  Sci-Fi  Thriller
    0         0  Action|Adventure|Fantasy|Sci-Fi       1          1        1       1         0
    1         1         Action|Adventure|Fantasy       1          1        1       0         0
    2         2        Action|Adventure|Thriller       1          1        0       0         1
    3         3                  Action|Thriller       1          0        0       0         1
    4         4          Action|Adventure|Sci-Fi       1          1        0       1         0
    

    Original answer

    One solution combining pandas and data preparation techniques from Machine Learning. Assuming you are on pandas v0.25 or later.

    First, let's create a dataframe from your screenshot:

    df = pd.DataFrame({
        'movie_id': range(5),
        'gernes': [
                    'Action|Adventure|Fantasy|Sci-Fi',
                    'Action|Adventure|Fantasy',
                    'Action|Adventure|Thriller',
                    'Action|Thriller',
                    'Action|Adventure|Sci-Fi'
                  ]
    })
    
       movie_id                           gernes
    0         0  Action|Adventure|Fantasy|Sci-Fi
    1         1         Action|Adventure|Fantasy
    2         2        Action|Adventure|Thriller
    3         3                  Action|Thriller
    4         4          Action|Adventure|Sci-Fi
    

    A movie can belong to multiple gernes. What we want is to seperate those gernes through a process called one-hot encoding. We define the categories (Action, Adventure, Thriller, etc.) and mark each movie as belonging to each category or not:

    from sklearn.preprocessing import OneHotEncoder
    
    s = df['gernes'].str.split('|').explode()
    encoder = OneHotEncoder()
    encoded = encoder.fit_transform(s.values[:, None])
    one_hot_df = pd.DataFrame(encoded.toarray(), columns=np.ravel(encoder.categories_), dtype='int') \
                    .groupby(s.index) \
                    .sum()
    
       Action  Adventure  Fantasy  Sci-Fi  Thriller
    0       1          1        1       1         0
    1       1          1        1       0         0
    2       1          1        0       0         1
    3       1          0        0       0         1
    4       1          1        0       1         0
    

    What it means is that the first movie belongs to the Action, Adventure, Fantasy and Sci-Fi but not Thriller categories, the second movie belongs to Action, Adventure and Fantasy and so on. The final stop is to combine them together:

    final = pd.concat([df, one_hot_df], axis=1)
    
       movie_id                           gernes  Action  Adventure  Fantasy  Sci-Fi  Thriller
    0         0  Action|Adventure|Fantasy|Sci-Fi       1          1        1       1         0
    1         1         Action|Adventure|Fantasy       1          1        1       0         0
    2         2        Action|Adventure|Thriller       1          1        0       0         1
    3         3                  Action|Thriller       1          0        0       0         1
    4         4          Action|Adventure|Sci-Fi       1          1        0       1         0