I'm trying to clean up a dataset that has data on every country in the world from 2000-2015. The population data by year is quite bad - I want to assign a z scores for each country's population data by year so I can see which data points to drop as outliers. How would I do this? I'm thinking I need to use groupby(), but I'm not sure how to deploy it.
I'm working with this WHO Kaggle dataset: https://www.kaggle.com/kumarajarshi/life-expectancy-who/data#
The data generally looks like this:
Maybe, something like this might work -
import numpy as np, pandas as pd
l1 = ['a'] * 5 + ['b'] * 10 + ['c'] * 8
l2 = list(np.random.randint(10,20,size=5)) + list(np.random.randint(100,150, size=10)) + list(np.random.randint(75,100, size=8))
df = pd.DataFrame({'cat':l1, 'values':l2}) #creating a dummy dataframe
df
cat values
0 a 18
1 a 17
2 a 11
3 a 13
4 a 11
5 b 102
6 b 103
7 b 119
8 b 113
9 b 100
10 b 113
11 b 102
12 b 108
13 b 128
14 b 126
15 c 75
16 c 96
17 c 81
18 c 90
19 c 80
20 c 95
21 c 96
22 c 86
df['z-score'] = df.groupby(['cat'])['values'].apply(lambda x: (x - x.mean())/x.std())
df
cat values z-score
0 a 18 1.206045
1 a 17 0.904534
2 a 11 -0.904534
3 a 13 -0.301511
4 a 11 -0.904534
5 b 102 -0.919587
6 b 103 -0.821759
7 b 119 0.743496
8 b 113 0.156525
9 b 100 -1.115244
10 b 113 0.156525
11 b 102 -0.919587
12 b 108 -0.332617
13 b 128 1.623951
14 b 126 1.428295
15 c 75 -1.520176
16 c 96 1.059516
17 c 81 -0.783121
18 c 90 0.322461
19 c 80 -0.905963
20 c 95 0.936674
21 c 96 1.059516
22 c 86 -0.168908