This is the Data set I'm working with:
("https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv" this is the link to data set in case image is not enough)
This is the question: Q. Using groupby make groups using the "Pclass" column and fill every group's "Embarked" column's missing values with the mode value of that group. After that, print every group's "Embarked" column's value counts in ascending order.
I've tried few solutions from all over the internet, but none of them seems to work. And its necessarily to be done using groupby() only.
I tried following solutions: titanic_df = pd.read_csv(url)
Looking at the dataframe, there are 2 NaN
values in Embarked
column:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQjh5HzZ1N0SU7ME9ZQRzeVTaXaGsV97rU8R7eAcg53k27GTstJp9cRUOfr55go1GRRvTz1NwvyOnuh/pub?gid=1562145139&single=true&output=csv'
titanic_df = pd.read_csv(url)
x = titanic_df[titanic_df['Embarked'].isna()]
print(x)
Prints:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
Every NaN
value is in Pclass 1. Looking further at .value_counts()
:
x = titanic_df.groupby('Pclass')['Embarked'].value_counts()
print(x)
Prints:
Pclass Embarked
1 S 127
C 85
Q 2
2 S 164
C 17
Q 3
3 S 353
Q 72
C 66
Name: Embarked, dtype: int64
So we want to replace these two NaNs with value S
. We can find this value with .mode()
:
titanic_df['Embarked'] = titanic_df.groupby('Pclass', group_keys=False)['Embarked'].transform(lambda x: x.fillna(x.mode()[0]))
The value counts is now:
Pclass Embarked
1 S 129
C 85
Q 2
2 S 164
C 17
Q 3
3 S 353
Q 72
C 66
Name: Embarked, dtype: int64