In SAS
, Proc Standard allow users to standardise data into a certain mean and standard deviation by a certain group.
Here, I want to standarize age to a of mean = 0
and deviation =5
for each surname. How can I do so in Pandas df
?
SAS code:
data mydata;
input surname $ name $ age ;
datalines;
Lim John 25
Lim David 100
Tan Mary 50
Tan Tom 30 ;
run;
PROC STANDARD MEAN=0 STD=5 DATA=mydata OUT=mydata11;
VAR age;
BY surname;
run;
SAS Output
surname name age
Lim John -3.535533906
Lim David 3.5355339059
Tan Mary 3.5355339059
Tan Tom -3.535533906
Following this answer from stats.stackexchange (Transform Data to Desired Mean and Standard Deviation), we can define a function that does this and apply it in a lambda function:
df = pd.DataFrame({'surname': ['Lim', 'Lim', 'Tan', 'Tan'],
'name': ['John', 'David', 'Mary', 'Tom'],
'age': [25, 100, 50, 30]})
def standard(x, mean, std):
x_std = mean + (x-x.mean()) * std/x.std()
return x_std
df['age_standard'] = (df.groupby('surname')['age']
.apply(lambda x: standard(x, 0, 5))
.reset_index(drop=True)
)
surname name age age_standard
0 Lim John 25 -3.535534
1 Lim David 100 3.535534
2 Tan Mary 50 3.535534
3 Tan Tom 30 -3.535534
We can confirm this:
df.groupby('surname')['age_standard'].mean() # Should be 0
df.groupby('surname')['age_standard'].std() # Should be 5