Search code examples
pandas-groupbyoutliers

Apply z-score across all attributes by country


I'm trying to clean up a dataset that has data on every country in the world from 2000-2015. The population data by year is quite bad - I want to assign a z scores for each country's population data by year so I can see which data points to drop as outliers. How would I do this? I'm thinking I need to use groupby(), but I'm not sure how to deploy it.

I'm working with this WHO Kaggle dataset: https://www.kaggle.com/kumarajarshi/life-expectancy-who/data#

The data generally looks like this:

Example


Solution

  • Maybe, something like this might work -

    import numpy as np, pandas as pd
    l1 = ['a'] * 5 + ['b'] * 10 + ['c'] * 8
    l2 = list(np.random.randint(10,20,size=5)) + list(np.random.randint(100,150, size=10)) + list(np.random.randint(75,100, size=8))
    df = pd.DataFrame({'cat':l1, 'values':l2}) #creating a dummy dataframe
    df
        cat  values
    0    a      18
    1    a      17
    2    a      11
    3    a      13
    4    a      11
    5    b     102
    6    b     103
    7    b     119
    8    b     113
    9    b     100
    10   b     113
    11   b     102
    12   b     108
    13   b     128
    14   b     126
    15   c      75
    16   c      96
    17   c      81
    18   c      90
    19   c      80
    20   c      95
    21   c      96
    22   c      86
    
    df['z-score'] = df.groupby(['cat'])['values'].apply(lambda x: (x - x.mean())/x.std())
    df
    
         cat  values   z-score
    0    a      18     1.206045
    1    a      17     0.904534
    2    a      11    -0.904534
    3    a      13    -0.301511
    4    a      11    -0.904534
    5    b     102    -0.919587
    6    b     103    -0.821759
    7    b     119     0.743496
    8    b     113     0.156525
    9    b     100    -1.115244
    10   b     113     0.156525
    11   b     102    -0.919587
    12   b     108    -0.332617
    13   b     128     1.623951
    14   b     126     1.428295
    15   c      75    -1.520176
    16   c      96     1.059516
    17   c      81    -0.783121
    18   c      90     0.322461
    19   c      80    -0.905963
    20   c      95     0.936674
    21   c      96     1.059516
    22   c      86    -0.168908