Search code examples
pythonnumpypandasnormalize

NumPy : normalize column B according to value of column A


Given a NumPy array [A B], were A are different indexes and B count values. How can I normalize the B values according to their A value?

I tried:

 def normalize(np_array):
    normalized_array = np.empty([1, 2])
    indexes= np.unique(np_array[:, 0]).tolist()

    for index in indexes:
        index_array= np_array[np_array[:, 0] == index]
        mean_id = np.mean(index_array[:, 1])
        std_id = np.std(index_array[:, 1])
        if mean_id * std_id > 0:
            index_array[:, 1] = (index_array[:, 1] - mean_id) / std_id
            normalized_array = np.concatenate([normalized_array, index_array])
    return np.delete(normalized_array, 0, 0) # my apologies

which is doing the job, but I'm looking for a more noble way to achieve this.

Any input would be warmly welcome.


Solution

  • Looks like pandas can be of help here:

    import pandas as pd
    
    df = pd.DataFrame({'ID': [1, 1, 2, 2, 1],
                       'value': [10, 20, 15, 100, 12]})
    
    byid = df.groupby('ID')
    mean = byid.mean()
    std = byid.std()
    
    df['normalized'] = df.apply(lambda x: (x.value - mean.ix[x.ID]) / std.ix[x.ID], axis=1)
    print(df)
    

    Output:

       ID  value  normalized
    0   1     10   -0.755929
    1   1     20    1.133893
    2   2     15   -0.707107
    3   2    100    0.707107
    4   1     12   -0.377964
    

    Coming from a NumPy array:

    >>> a
    array([[  1,  10],
           [  1,  20],
           [  2,  15],
           [  2, 100],
           [  1,  12]])
    

    You can create your dataframe like this:

    >>> df = pd.DataFrame({'ID': a[:, 0], 'value': a[:, 1]})
    >>> df
       ID  value
    0   1     10
    1   1     20
    2   2     15
    3   2    100
    4   1     12