I am having an issue while using Series.unique() in titanic dataframe.
While using the Series.unique() in the original df gives no error, but on concatenating train and tests based on specific columns, and then using Series.unique() gives me the error.
This according to what I have tried is being caused by replacing null values in the 5th statement. If I comment out that line, the code works without giving any error. Why is it so? And also is there any work around?
cat_cols = ['Pclass', 'Sex', 'Embarked']
df_train = pd.read_csv('train.csv')
df_pred = pd.read_csv('test.csv')
df_join = pd.concat([df_train[cat_cols], df_pred[cat_cols]])
df_join = df_join.fillna(df_join.mode, axis=0)
df_join.Embarked.unique()
The train and test files can be download from:
https://www.kaggle.com/c/titanic/download/test.csv https://www.kaggle.com/c/titanic/download/train.csv
I am currently using Pandas Version 0.23.4
Given:
cat_cols = ['Pclass', 'Sex', 'Embarked']
df_train = pd.read_csv('train.csv')
df_pred = pd.read_csv('test.csv')
df_join = pd.concat([df_train[cat_cols], df_pred[cat_cols]])
NaN
values occur only in Embarked
column as can be verified from below code:
df_join.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 3 columns):
Pclass 1309 non-null int64
Sex 1309 non-null object
Embarked 1307 non-null object
dtypes: int64(1), object(2)
memory usage: 80.9+ KB
So, replacing the NaN
with the mode of the Embarked
column values:
df_join.Embarked = df_join.Embarked.fillna(df_join.Embarked.mode()[0])
df_join.Embarked.value_counts().sum()
# 1309
and looking for unique values:
df_join.Embarked.unique()
# array(['S', 'C', 'Q'], dtype=object)
Tip: It's not mode
but mode()[0]
Hope I answered your query, if not comment down your query.