python python-3.x pandas data-processing

DataFrame objects are mutable thus they cannot be hashed while using Series.unique()

I am having an issue while using Series.unique() in titanic dataframe.

While using the Series.unique() in the original df gives no error, but on concatenating train and tests based on specific columns, and then using Series.unique() gives me the error.

This according to what I have tried is being caused by replacing null values in the 5th statement. If I comment out that line, the code works without giving any error. Why is it so? And also is there any work around?

cat_cols = ['Pclass', 'Sex', 'Embarked']
df_train = pd.read_csv('train.csv')
df_pred = pd.read_csv('test.csv')
df_join = pd.concat([df_train[cat_cols], df_pred[cat_cols]])
df_join = df_join.fillna(df_join.mode, axis=0)
df_join.Embarked.unique()

The train and test files can be download from:

https://www.kaggle.com/c/titanic/download/test.csv https://www.kaggle.com/c/titanic/download/train.csv

I am currently using Pandas Version 0.23.4

Solution

Given:

cat_cols = ['Pclass', 'Sex', 'Embarked']
df_train = pd.read_csv('train.csv')
df_pred = pd.read_csv('test.csv')
df_join = pd.concat([df_train[cat_cols], df_pred[cat_cols]])

NaN values occur only in Embarked column as can be verified from below code:

df_join.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 3 columns):
Pclass      1309 non-null int64
Sex         1309 non-null object
Embarked    1307 non-null object
dtypes: int64(1), object(2)
memory usage: 80.9+ KB

So, replacing the NaN with the mode of the Embarked column values:

df_join.Embarked = df_join.Embarked.fillna(df_join.Embarked.mode()[0])
df_join.Embarked.value_counts().sum()
# 1309

and looking for unique values:

df_join.Embarked.unique()
# array(['S', 'C', 'Q'], dtype=object)

Tip: It's not mode but mode()[0]

Hope I answered your query, if not comment down your query.