Search code examples
python-3.xpandaslambdapandas-groupby

Grouping and taking average length of characters


I have a dataset as under

Cty  A1  
AA  A123
AA  1111 
AA  99999
BB  a9999
BB  345689
BB  345699 
CC  1233 

I need to group the data by country, and identify mean length of characters of column A1 and A2 against each country.

So for Country AA, we see that there are 13 character for the 3 rows in A1 column, and while calculating average we would get 4.333 (i.e 13 / 3), same way for B we would get an average of 5.66

I used the below code to achieve the same

df = (df.groupby('Cty')['A1'].apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))

But got the below error

~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)

~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 

AttributeError: Can only use .str accessor with string values!

I tried to convert the column to string and run as below, but got an attribute error

df = (df.groupby('Cty')['A1'].astype(str).str.apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))

Error msg
AttributeError: 'SeriesGroupBy' object has no attribute 'astype'

Idea is to have the average length of characters for each country in a separate column.

Not sure how to go about the same

Help would be much appretiated.


Solution

  • You can use GroupBy.mean passing df['Cty'] as grouper:

    df['A1'].astype(str).str.len().groupby(df['Cty']).mean()
    

    NB. conversion to string can be skipped if you already have strings.

    Output:

    Cty
    AA    4.333333
    BB    5.666667
    CC    4.000000
    Name: A1, dtype: float64
    

    NB. if you want a DataFrame, add your .reset_index(name='mean_len_text')

    A fix of your approach could be:

    df.groupby('Cty')['A1'].apply(lambda x: x.str.len().mean())
    

    But this should be significantly less efficient.