I am following the University of Michigan MOOC about Data Science in Python Pandas, and I encounter some problem in a test.
I have to use the groupby function to calculate the sum, mean, size and standard deviation of 15 countries, grouped by continent.
The problem is that sum(), std() and size() work with no problem, but not mean(), and I don't know why.
I already tried to specify the type by using dtype=float
but I does not work.
This is my code :
# --------- This part is ok, just describing so you can understand --------- #
Top15 = answer_one() # load top 15 countries with most scientific publications
# list of the continents for the top 15 countries
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
# estimation of the population for each countries
# by calculating the Energy Supply / Energy Supply per Capita
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15 = Top15[['PopEst']]
Top15.reset_index(inplace = True)
Top15['Continent'] = None
# loop that add the coresponding continent to the country
for country in Top15['Country']:
index_country = ((Top15.loc[Top15['Country'] == country]) # seek country index
.index)
Top15.iloc[index_country,2] = ContinentDict[country] # add continent to country
# ---------- This is the part where I am having problem ---------- #
# create the 'answer' DataFrame
answer = pd.DataFrame(index=['Asia', 'Australia',
'Europe', 'North America',
'South America'],
columns=['size', 'sum', 'mean', 'std'], dtype=float)
grouped = Top15.groupby('Continent') # group countries by continent
answer['size'] = grouped.size()
answer['sum'] = grouped['PopEst'].sum()
answer['mean'] = grouped['PopEst'].mean()
answer['std'] = grouped['PopEst'].std()
I got at the line answer['mean'] = grouped['PopEst'].mean()
, the error:
DataError: No numeric types to aggregate
I don't know where the problem is.
PopEst contains numeric values. For example, the population estimate of China is 1.36765e+09 people.
This is the DataFrame Top15 returned by answer_one()
I have to work on:
Country PopEst Continent
0 Australia 2.3316e+07 Australia
1 Brazil 2.05915e+08 South America
2 Canada 3.52399e+07 North America
3 China 1.36765e+09 Asia
4 France 6.38373e+07 Europe
5 Germany 8.03697e+07 Europe
6 India 1.27673e+09 Asia
7 Iran 7.70756e+07 Asia
8 Italy 5.99083e+07 Europe
9 Japan 1.27409e+08 Asia
10 Russian Federation 1.435e+08 Europe
11 South Korea 4.98054e+07 Asia
12 Spain 4.64434e+07 Europe
13 United Kingdom 6.3871e+07 Europe
14 United States 3.17615e+08 North America
This is what Top15.to_dict()
returns me:
{'Country': {0: 'Australia',
1: 'Brazil',
2: 'Canada',
3: 'China',
4: 'France',
5: 'Germany',
6: 'India',
7: 'Iran',
8: 'Italy',
9: 'Japan',
10: 'Russian Federation',
11: 'South Korea',
12: 'Spain',
13: 'United Kingdom',
14: 'United States'},
'PopEst': {0: 23316017.316017315,
1: 205915254.23728815,
2: 35239864.86486486,
3: 1367645161.2903225,
4: 63837349.39759036,
5: 80369696.96969697,
6: 1276730769.2307692,
7: 77075630.25210084,
8: 59908256.880733944,
9: 127409395.97315437,
10: 143500000.0,
11: 49805429.864253394,
12: 46443396.2264151,
13: 63870967.741935484,
14: 317615384.61538464},
'Continent': {0: 'Australia',
1: 'South America',
2: 'North America',
3: 'Asia',
4: 'Europe',
5: 'Europe',
6: 'Asia',
7: 'Asia',
8: 'Europe',
9: 'Asia',
10: 'Europe',
11: 'Asia',
12: 'Europe',
13: 'Europe',
14: 'North America'}}
This is a bug of Pandas
, Pandas still do sum and prod calculation in groupby even if data is not numerical. I checked the source code, the bug appears in line 1373 of site-packages\pandas\core\groupby\groupby.py
. It writes:
except Exception:
pass
if you print the error, you will probably find 'No numeric types to aggregate' too.
As a walk-around solution, you could change the data to numerical before doing any calculations by using:
df['column'] = pd.to_numeric(df['column'])
some post may tell you to add errors='coerce'
inside pd.to_numeric
so that non-numerical element will be replace by na
and won't raise an error. However, in many situations, it means some errors in the data. We need to fix the data rather than silencing the error.