Search code examples
pythonpandasdataframemulti-index

How to count the length of strings stored inside a pandas groupby?


I'm looking at wine review data from kaggle, and trying to practice using pandas. I'm running into a bit of an issue. What I'd like to do is group the wines into their country and variety, and then determine the average length of a description that wine reviewers have given a particular variety of wine from a particular country.

To do this, I've attempted

average_desc = reviews.groupby(['country','variety']).description.apply(lambda x: len(x)).mean()

but all this gives me is some mean value. I don't quite know what the issue is, or how I could provide more information, first post here so let me know how to clarify. I thought that once

reviews.groupby(['country','variety']).description

I'm essentially looking at the column which contains the wine descriptions sorted by variety, and just need to convert to the length of the strings inside the columns then take the mean. One thing I can say is that

average_desc = reviews.groupby(['country','variety']).description.apply(lambda x: len(x))

returns a count of how many of the different varieties of wines are to be found in the country below

country    variety                   
Argentina  Barbera                         1
           Bonarda                       105
           Bordeaux-style Red Blend       89
           Bordeaux-style White Blend      1
           Cabernet Blend                  8
                                        ... 
Uruguay    Tannat-Cabernet Franc           2
           Tannat-Merlot                   6
           Tannat-Syrah                    1
           Tempranillo-Tannat              1
           White Blend                     1
Name: description, Length: 1612, dtype: int64

So is there an issue of not indexing far enough into the dataframe? Thanks in advance for the help, and let me know if I need to write more to clarify the situation.


Solution

  • IIUC, I think you want something like this:

    reviews['description'].str.len().groupby([df['country'], df['variety']]).mean()
    

    Output:

    country    variety                   
    Argentina  Barbera                       236.000000
               Bonarda                       242.123810
               Bordeaux-style Red Blend      311.426966
               Bordeaux-style White Blend    274.000000
               Cabernet Blend                281.000000
                                                ...    
    Uruguay    Tannat-Cabernet Franc         300.500000
               Tannat-Merlot                 258.333333
               Tannat-Syrah                  275.000000
               Tempranillo-Tannat            260.000000
               White Blend                   248.000000
    Name: description, Length: 1612, dtype: float64