Search code examples
pandasdataframegroup-by

After applying groupby, how to fill missing values with mode value of each group respectively?


This is the Data set I'm working with: Titanic Dataset

("https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv" this is the link to data set in case image is not enough)

This is the question: Q. Using groupby make groups using the "Pclass" column and fill every group's "Embarked" column's missing values with the mode value of that group. After that, print every group's "Embarked" column's value counts in ascending order.

I've tried few solutions from all over the internet, but none of them seems to work. And its necessarily to be done using groupby() only.

I tried following solutions: titanic_df = pd.read_csv(url)

  1. titanic_df.groupby('Pclass')['Embarked'].fillna(agg({'Embarked':'mode'}))
  2. titanic_df.groupby('Pclass')['Embarked'].apply(lambda x: x.fillna(x.mode()))

Solution

  • Looking at the dataframe, there are 2 NaN values in Embarked column:

    url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv'
    titanic_df = pd.read_csv(url)
    
    x = titanic_df[titanic_df['Embarked'].isna()]
    print(x)
    

    Prints:

         PassengerId  Survived  Pclass                                       Name     Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
    61            62         1       1                        Icard, Miss. Amelie  female  38.0      0      0  113572  80.0   B28      NaN
    829          830         1       1  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0      0      0  113572  80.0   B28      NaN
    

    Every NaN value is in Pclass 1. Looking further at .value_counts():

    x = titanic_df.groupby('Pclass')['Embarked'].value_counts()
    print(x)
    

    Prints:

    Pclass  Embarked
    1       S           127
            C            85
            Q             2
    2       S           164
            C            17
            Q             3
    3       S           353
            Q            72
            C            66
    Name: Embarked, dtype: int64
    

    So we want to replace these two NaNs with value S. We can find this value with .mode():

    titanic_df['Embarked'] = titanic_df.groupby('Pclass', group_keys=False)['Embarked'].transform(lambda x: x.fillna(x.mode()[0]))
    

    The value counts is now:

    Pclass  Embarked
    1       S           129
            C            85
            Q             2
    2       S           164
            C            17
            Q             3
    3       S           353
            Q            72
            C            66
    Name: Embarked, dtype: int64