Search code examples
pandasgroup-bynormalize

pandas how to applied fit_transform on standard scaler on group by data


There is a dataframe like this

df

group data other
A 1 a
A 2 b
A 3 ad
A 4 aw
A 5 ad
B 100 ta
B 200 as
B 300 ab
B 400 ax
B 500 ad

I would like to groupby("group") then apply standard_scaler().fit_transform() in each group

( I test with only data with a single group in data from with this code which is working but I having problem when group data >1

df['data'] = pd.DataFrame(scaler.fit_transform(df.groupby('group').data.values.reshape(-1,1)))

)

I was wondering is there a way to solve this with multiple group and apply scaler in each group?

Edited: My Desire output would be

group data other
A -1.414 a
A -0.7071 b
A 0 ad
A 0.7071 aw
A 1.414 ad
B -1.414 ta
B -0.7071 as
B 0 ab
B 0.7071 ax
B 1.414 ad

where data is transform data of normalized data


Solution

  • Use:

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    def sc(row):
        return scaler.fit_transform(row.values.reshape(-1,1))
    
    df.groupby('group').agg(sc)
    

    Please, note that your other column is not numeric so you can not apply standard scaler on that. I tried the above code with two numeric columns.

    Demonstration for two numeric columns:

    d = '''group    data    other
    A   1   a
    A   2   b
    A   3   ad
    A   4   aw
    A   5   ad
    B   100 ta
    B   200 as
    B   300 ab
    B   400 ax
    B   500 ad'''
    data = [x.split('   ') for x in d.split('\n') if x!='']
    df = pd.DataFrame(data[1:], columns = data[0])
    df['other'] = df['data']*2
    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    def sc(row):
        return scaler.fit_transform(row.values.reshape(-1,1))
    
    df.groupby('group').agg(sc)
    

    Output:

        data    other
    group       
    A   [[-1.414213562373095], [-0.7071067811865475], ...   [[-1.4142135623730951], [-0.7071067811865476],...
    B   [[-1.414213562373095], [-0.7071067811865475], ...   [[-1.4142135623730951], [-0.7071067811865476],...
    

    Based on the comment:

    df['new'] = df.groupby('group')['data'].apply(sc).explode().values.astype(float)