Search code examples
pythonpandascrosstab

The columns are disarrayed in pandas crosstab


jupyter notebook image

The code is to build a pd.crosstab with Titanic dataset in Seaborn. The column sums in the output table look disarrayed.

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

bin = [0,15,100]
titanic["adult"] = pd.cut(titanic.age, bin, labels=["kid","adult"])
pd.crosstab(titanic.survived, titanic.adult, normalize=True, margins=True)

I expected to have 0.116246 / 0.883754 / 1.000000, but it gives 0.883754 / 0.116246 / 1.000000 in the last row where the column sums should be placed.


Solution

  • The flipping/reversal of totals is simply due to the presence of NaN values in the original age column, and subsequently in the binned adult column you created. You should just add dropna=False to your pd.crosstab() command, which will return the right result:

    pd.crosstab(titanic.survived, titanic.adult, dropna=False, normalize=True, margins=True)
    
    adult   kid     adult       All
    survived            
    0   0.047619    0.546218    0.616162
    1   0.068627    0.337535    0.383838
    All 0.116246    0.883754    1.000000