I'm looking at wine review data from kaggle, and trying to practice using pandas. I'm running into a bit of an issue. What I'd like to do is group the wines into their country and variety, and then determine the average length of a description that wine reviewers have given a particular variety of wine from a particular country.
To do this, I've attempted
average_desc = reviews.groupby(['country','variety']).description.apply(lambda x: len(x)).mean()
but all this gives me is some mean value. I don't quite know what the issue is, or how I could provide more information, first post here so let me know how to clarify. I thought that once
reviews.groupby(['country','variety']).description
I'm essentially looking at the column which contains the wine descriptions sorted by variety, and just need to convert to the length of the strings inside the columns then take the mean. One thing I can say is that
average_desc = reviews.groupby(['country','variety']).description.apply(lambda x: len(x))
returns a count of how many of the different varieties of wines are to be found in the country below
country variety
Argentina Barbera 1
Bonarda 105
Bordeaux-style Red Blend 89
Bordeaux-style White Blend 1
Cabernet Blend 8
...
Uruguay Tannat-Cabernet Franc 2
Tannat-Merlot 6
Tannat-Syrah 1
Tempranillo-Tannat 1
White Blend 1
Name: description, Length: 1612, dtype: int64
So is there an issue of not indexing far enough into the dataframe? Thanks in advance for the help, and let me know if I need to write more to clarify the situation.
IIUC, I think you want something like this:
reviews['description'].str.len().groupby([df['country'], df['variety']]).mean()
Output:
country variety
Argentina Barbera 236.000000
Bonarda 242.123810
Bordeaux-style Red Blend 311.426966
Bordeaux-style White Blend 274.000000
Cabernet Blend 281.000000
...
Uruguay Tannat-Cabernet Franc 300.500000
Tannat-Merlot 258.333333
Tannat-Syrah 275.000000
Tempranillo-Tannat 260.000000
White Blend 248.000000
Name: description, Length: 1612, dtype: float64