Search code examples
pythonpandasdataframedatasetimdb

How to display the variable name in a Python DataFrame instead of the column name?


I'm currently studying the basics of data analysis with Python in Colab, and for that I'm using my IMDb watchlist as a dataset.

In the column Genres, several movie genres can be registered in the same cell (which makes things more difficult), and I'm trying to calculate the proportions of the genres presented in this dataset and then plot it with a pie or barh chart maybe.

dataset

So I created variables to store the value_counts() of each genre as True or False, as you can see below:

action = df['Genres'].str.contains('Action').value_counts()
animation = df['Genres'].str.contains('Animation').value_counts()
biography = df['Genres'].str.contains('Biography').value_counts()
comedy = df['Genres'].str.contains('Comedy').value_counts()
crime = df['Genres'].str.contains('Crime').value_counts()
drama = df['Genres'].str.contains('Drama').value_counts()
documentary = df['Genres'].str.contains('Documentary').value_counts()
family = df['Genres'].str.contains('Family').value_counts()
fantasy = df['Genres'].str.contains('Fantasy').value_counts()
film_noir = df['Genres'].str.contains('Film-Noir').value_counts()
history = df['Genres'].str.contains('History').value_counts()
horror = df['Genres'].str.contains('Horror').value_counts()
mystery = df['Genres'].str.contains('Mystery').value_counts()
music = df['Genres'].str.contains('Music').value_counts()
musical = df['Genres'].str.contains('Musical').value_counts()
romance = df['Genres'].str.contains('Romance').value_counts()
scifi = df['Genres'].str.contains('Sci-Fi').value_counts()
sport = df['Genres'].str.contains('Sport').value_counts()
thriller = df['Genres'].str.contains('Thriller').value_counts()
war = df['Genres'].str.contains('War').value_counts()
western = df['Genres'].str.contains('Western').value_counts()

Then I put these variables into a DataFrame:

genres = pd.DataFrame(
    [action, animation, biography,
     comedy, crime, drama,
     documentary, family, fantasy,
     film_noir, history, horror,
     mystery, music, musical,
     romance, scifi, sport,
     thriller, war, western],
    )
genres.head(5)

The problem is in the output:

output

I'd like it to display the variable names instead of 'Genres', as it's being show in the first column. Is it possible?


Solution

  • To avoid using a relatively slow for loop :

    Let's suppose with have the following dataframe

                           Genres
    0              Comedy, Horror
    1          Comedy, Drama, War
    2  Mistery, Romance, Thriller
    

    Proposed code

    import pandas as pd
    
    # create the original DataFrame
    df = pd.DataFrame({'Genres': ['Comedy, Horror', 'Comedy, Drama, War', 'Mistery, Romance, Thriller']})
    
    # split the genres by comma and remove leading spaces
    df['Genres'] = df['Genres'].str.split(',').apply(lambda x: [i.strip() for i in x])
    
    # explode the list into separate rows
    df = df.explode('Genres')
    
    # Counting Matrix using crosstab method
    genre_counts = pd.crosstab(index=df.index, columns=df['Genres'], margins=False).to_dict('index')
    
    genre_counts = pd.DataFrame(genre_counts)
    
    # count the number of 0s and 1s in each row
    counts = ( genre_counts.apply(lambda row: [sum(row == 0), sum(row == 1)], axis=1) )
    
    # Final count with 2 columns 'False' and 'True'
    counts = pd.DataFrame(counts.tolist(), index=counts.index, columns=['False', 'True'])
    
    print(counts)
    

    Vizualisation

              False  True
    Comedy        1     2
    Drama         2     1
    Horror        2     1
    Mistery       2     1
    Romance       2     1
    Thriller      2     1
    War           2     1