Search code examples
pythonpandassasmeanstandard-deviation

SAS Proc Standard and the equivalent in Python


In SAS, Proc Standard allow users to standardise data into a certain mean and standard deviation by a certain group. Here, I want to standarize age to a of mean = 0 and deviation =5 for each surname. How can I do so in Pandas df?

SAS code:

data mydata;
input surname $ name $ age ;
datalines;
Lim John 25 
Lim David 100 
Tan Mary 50 
Tan Tom 30 ;
run;

PROC STANDARD MEAN=0 STD=5 DATA=mydata OUT=mydata11;
VAR age;
BY surname; 
 run;

SAS Output

surname name    age
Lim John    -3.535533906
Lim David   3.5355339059
Tan Mary    3.5355339059
Tan Tom -3.535533906

Solution

  • Following this answer from stats.stackexchange (Transform Data to Desired Mean and Standard Deviation), we can define a function that does this and apply it in a lambda function:

    df = pd.DataFrame({'surname': ['Lim', 'Lim', 'Tan', 'Tan'],
                       'name':    ['John', 'David', 'Mary', 'Tom'],
                       'age':     [25, 100, 50, 30]})
    
    def standard(x, mean, std):
        x_std = mean + (x-x.mean()) * std/x.std()
        return x_std
    
    df['age_standard'] = (df.groupby('surname')['age']
                            .apply(lambda x: standard(x, 0, 5))
                            .reset_index(drop=True)
                         )
    
      surname   name  age   age_standard
    0     Lim   John   25      -3.535534
    1     Lim  David  100       3.535534
    2     Tan   Mary   50       3.535534
    3     Tan    Tom   30      -3.535534
    

    We can confirm this:

    df.groupby('surname')['age_standard'].mean() # Should be 0
    df.groupby('surname')['age_standard'].std()  # Should be 5