Search code examples
pythonpandasdataframedata-science

Groupby Mean not working on titanic dataset in Python


I am using titanic dataset and tring to run the groupby command but its not working as shown on countless tutorials online. I have named my dataframe as ks_cl. Here is the command I executed in VScode:

ks_cl.groupby(['sex']).mean()

This is the output:

NotImplementedError                       Traceback (most recent call last)
File d:\Program Files\Python\Lib\site-packages\pandas\core\groupby\groupby.py:1490, in GroupBy._cython_agg_general..array_func(values)
   1489 try:
-> 1490     result = self.grouper._cython_operation(
   1491         "aggregate",
   1492         values,
   1493         how,
   1494         axis=data.ndim - 1,
   1495         min_count=min_count,
   1496         **kwargs,
   1497     )
   1498 except NotImplementedError:
   1499     # generally if we have numeric_only=False
   1500     # and non-applicable functions
   1501     # try to python agg
   1502     # TODO: shouldn't min_count matter?

File d:\Program Files\Python\Lib\site-packages\pandas\core\groupby\ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
    958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
    960     values=values,
    961     axis=axis,
    962     min_count=min_count,
    963     comp_ids=ids,
...
   1698             # e.g. "foo"
-> 1699             raise TypeError(f"Could not convert {x} to numeric") from err
   1700 return x

TypeError: Could not convert CSSSCSSSSSQSSSCSSCQSCSSSSSSSSSSSSCSCSSSSSSSSSQSSSCSSSCCSSQSCSCSSSSSSSCSSSSSSSQSCSSCCCSSSSCQSCSSCCSSSSCCSSCSSCCSSSSSQSSSSSSSSSSSSSCSCSCSSSCSQSSSCSSSCSSSSCCSSSSSCSSSSSSSCSCSCSSSSSSSSSCSCSSQQSSSCCSSCSSSSSSSSSSSQSSSCSSSSSSSSSSSSCCCCSSSSCSSCSCCCSSQS to numeric
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

I was expecting this output:

enter image description here


Solution

  • You need to turn on numeric_only in GroupBy.mean :

    numeric_only : (bool), default None
    Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

    Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

    Source : [docs]

    And as per pandas 2.0.0 :

    Changed default of numeric_only in various DataFrameGroupBy methods; all methods now default to numeric_only=False (GH46072)

    link = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
    
    ks_cl = pd.read_csv(link)
    ​
    out = ks_cl.groupby("Sex").mean(numeric_only=True)
    

    ​ Output :

    print(out)
    
            PassengerId  Survived   Pclass       Age    SibSp    Parch      Fare
    Sex                                                                         
    female   431.028662  0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
    male     454.147314  0.188908 2.389948 30.726645 0.429809 0.235702 25.523893