Search code examples
pythonpandasdataframegroup-bypandas-explode

Using groupby() on an exploded pandas DataFrame returns a data frame where indeces are repeated but they have different attributes


I am working with a dataset found in kaggle (https://www.kaggle.com/datasets/shivamb/netflix-shows) which has data regarding different productions on Netflix. I am looking to answer the following question: How many productions are produced from each country.

Because there are productions which are essentially co-productions, meaning that under the column named 'country' there might be occasions where we have comma-separated strings in the form of : 'Country1','Country2',... etc, I have split the question into 2 parts:

The first part, is to filter the dataframe and keep only the rows which are single country productions and so we can find out how many productions are produced from each country provided that these are not co-productions. I encountered no problem during the first phase.

The second part is to take into account the co-productions. So, every time a country is found into a co-production the total number of it's productions is +=1.

The method I chose to solve the second part, is to convert into lists the comma-separated string values of countries and then to explode the 'country' column, then groupby('country') and get the .sum() of the attributes.

The problem that I have encountered is, that when I groupby('country'), there are countries which are repeated, for example United States is found with two different 'count' values as seen in the pics below.

(I need to underline that I have added to the original dataframe a column named 'count' which equals to the value 1 for every row and that df_no_nan_country = df[df['country'].notna()])

Why is that happening and what can I do to fix it?

This is my code:

import pandas as pd
df = pd.read_csv('netflix_titles.csv')
df['count'] = 1
df_no_nan_country = df[df['country'].notna()]
df_split_countries = df_no_nan_country.assign(country=df['country'].str.split(',')).explode('country')
df_split_countries.groupby('country').sum()
df_split_countries_max = df_split_countries.groupby('country').sum()[df_split_countries.groupby('country').sum()['count']>100]
df_split_countries_max.head(30)

enter image description here


Solution

  • try cleaning the column country before groupby

    df_split_countries['country'] = df_split_countries['country'].str.strip()
    df_split_countries['country'] = df_split_countries['country'].map(lambda x: .encode(x, errors="ignore").decode())
    

    I think countries are repeated because they are not exactly the same string values, maybe one contain extra space...