Search code examples
pythonpandaspandas-groupby

groupby.mean() doesn't work while sum(), std() and size() all work


I am following the University of Michigan MOOC about Data Science in Python Pandas, and I encounter some problem in a test.

I have to use the groupby function to calculate the sum, mean, size and standard deviation of 15 countries, grouped by continent.

The problem is that sum(), std() and size() work with no problem, but not mean(), and I don't know why.

I already tried to specify the type by using dtype=float but I does not work.

This is my code :

# --------- This part is ok, just describing so you can understand --------- #
Top15 = answer_one() # load top 15 countries with most scientific publications

# list of the continents for the top 15 countries
ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}

# estimation of the population for each countries 
# by calculating the Energy Supply / Energy Supply per Capita
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15 = Top15[['PopEst']]

Top15.reset_index(inplace = True)
Top15['Continent'] = None

# loop that add the coresponding continent to the country
for country in Top15['Country']:
    index_country = ((Top15.loc[Top15['Country'] == country]) # seek country index
                           .index)
    Top15.iloc[index_country,2] = ContinentDict[country] # add continent to country


# ---------- This is the part where I am having problem ---------- #
# create the 'answer' DataFrame
answer = pd.DataFrame(index=['Asia', 'Australia', 
                             'Europe', 'North America', 
                             'South America'], 
                      columns=['size', 'sum', 'mean', 'std'], dtype=float)

grouped = Top15.groupby('Continent')      # group countries by continent

answer['size'] = grouped.size()
answer['sum'] = grouped['PopEst'].sum()
answer['mean'] = grouped['PopEst'].mean()
answer['std'] = grouped['PopEst'].std()

I got at the line answer['mean'] = grouped['PopEst'].mean() , the error:

DataError: No numeric types to aggregate

I don't know where the problem is.

PopEst contains numeric values. For example, the population estimate of China is 1.36765e+09 people.

This is the DataFrame Top15 returned by answer_one() I have to work on:

    Country             PopEst      Continent  
0   Australia           2.3316e+07  Australia
1   Brazil              2.05915e+08 South America
2   Canada              3.52399e+07 North America
3   China               1.36765e+09 Asia
4   France              6.38373e+07 Europe
5   Germany             8.03697e+07 Europe
6   India               1.27673e+09 Asia
7   Iran                7.70756e+07 Asia
8   Italy               5.99083e+07 Europe
9   Japan               1.27409e+08 Asia
10  Russian Federation  1.435e+08   Europe
11  South Korea         4.98054e+07 Asia
12  Spain               4.64434e+07 Europe
13  United Kingdom      6.3871e+07  Europe
14  United States       3.17615e+08 North America

This is what Top15.to_dict() returns me:

{'Country': {0: 'Australia',
  1: 'Brazil',
  2: 'Canada',
  3: 'China',
  4: 'France',
  5: 'Germany',
  6: 'India',
  7: 'Iran',
  8: 'Italy',
  9: 'Japan',
  10: 'Russian Federation',
  11: 'South Korea',
  12: 'Spain',
  13: 'United Kingdom',
  14: 'United States'},
 'PopEst': {0: 23316017.316017315,
  1: 205915254.23728815,
  2: 35239864.86486486,
  3: 1367645161.2903225,
  4: 63837349.39759036,
  5: 80369696.96969697,
  6: 1276730769.2307692,
  7: 77075630.25210084,
  8: 59908256.880733944,
  9: 127409395.97315437,
  10: 143500000.0,
  11: 49805429.864253394,
  12: 46443396.2264151,
  13: 63870967.741935484,
  14: 317615384.61538464},
 'Continent': {0: 'Australia',
  1: 'South America',
  2: 'North America',
  3: 'Asia',
  4: 'Europe',
  5: 'Europe',
  6: 'Asia',
  7: 'Asia',
  8: 'Europe',
  9: 'Asia',
  10: 'Europe',
  11: 'Asia',
  12: 'Europe',
  13: 'Europe',
  14: 'North America'}}

Solution

  • This is a bug of Pandas, Pandas still do sum and prod calculation in groupby even if data is not numerical. I checked the source code, the bug appears in line 1373 of site-packages\pandas\core\groupby\groupby.py. It writes:

                    except Exception:
                        pass
    

    if you print the error, you will probably find 'No numeric types to aggregate' too.

    As a walk-around solution, you could change the data to numerical before doing any calculations by using:

    df['column'] = pd.to_numeric(df['column'])
    

    some post may tell you to add errors='coerce' inside pd.to_numeric so that non-numerical element will be replace by na and won't raise an error. However, in many situations, it means some errors in the data. We need to fix the data rather than silencing the error.