Search code examples
pythonarraysnumpyaggregate

pythonic way to aggregate arrays (numpy or not)


I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...


Solution

  • Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

    import matplotlib.mlab
    
    data=np.array(
        [('Aaron','Digger',1),
         ('Bill','Planter',2),
         ('Carl','Waterer',3),
         ('Darlene','Planter',3),
         ('Earl','Digger',7)],
        dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])
    
    result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))
    

    yields

    ('Digger', 4.0)
    ('Planter', 2.5)
    ('Waterer', 3.0)
    

    matplotlib.mlab.rec_groupby returns a recarray:

    print(result.dtype)
    # [('job', '|S7'), ('avg_income', '<f8')]
    

    You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.