Search code examples
pythonlistnumpygroup-bypython-itertools

Apply multiple aggregate functions on a list without using pandas


I have a list of rows, with two key columns and 6 value columns. I want to calculate the sum, mean, and max of the value columns, based on lists of columns for each function, grouped by the key columns, using itertools.groupby, without using pandas.

columns = [ID, date, row1, row2, row3, row4, row5, row6]
rows = [[1, date1, 1,     1,    1,   0,    0,  1],
        [1, date1, 1,     2,    0,   1,    0,  1],
        [2, date1, 1,     3,    0,   0,    1,  1], 
        [2, date1, 1,     3,    0,   0,    2,  1]]

ColSum = [row2, row6]
ColMean = [row1, row3]
ColMax = [row4, row5]

The result I expect to calculate:

rows_result = [[1, date1, 1,     3,  0.5,   1,    0,  2],
               [2, date1, 1,     6,    0,   0,    2,  2]] 

My code:

for cols in ColSum:
  index = int(np.where(columns == cols)[0][0])
        
  for k, g in itertools.groupby(rows[:, index], operator.itemgetter(0,1)):
    res.append((list(k) +list(map(sum, zip(*[c[2:] for c in g])))))

for colm in colMean:
  index = int(np.where(columns == colm)[0][0])
    
  for k, g in itertools.groupby(rows[:, index], operator.itemgetter(0,1)):
    res.append((list(k) +list(map(sum, zip(*[c[2:] for c in g])))))

However, this doesn't work.


Solution

  • Calculating each of the grouped column results separately while constructing the results list is pretty complicated. After initial construction for ColSum, you would have to edit each row-level element. It can be done, but I do not think it fits with the spirit of functional programming.

    Instead, the program below uses ColFun to set the function to be applied to each of the columns (equivalent to row1 through row6). In the for loop, it runs through each of the groups (like (1, 'date1')), while constructing a list of the 6 value columns with the requested function application on the column.

    The result is appended to a list, which matches the (slightly edited) desired result.

    import itertools
    import operator
    import numpy as np
    
    columns = ['ID', 'date', 'row1', 'row2', 'row3', 'row4', 'row5', 'row6']
    rows = [[1, 'date1', 1,     1,    1,   0,    0,  1],
            [1, 'date1', 1,     2,    0,   1,    0,  1],
            [2, 'date1', 1,     3,    0,   0,    1,  1], 
            [2, 'date1', 1,     3,    0,   0,    2,  1]]
    
    ColSum = ['row2', 'row6']
    ColMean = ['row1', 'row3']
    ColMax = ['row4', 'row5']
    ColFun = [np.mean, np.sum, np.mean, np.max, np.max, np.sum]
    rows_result = [[1, 'date1', 1.0,   3,  0.5,   1,    0,  2],
                   [2, 'date1', 1.0,   6,  0.0,   0,    2,  2]]
    
    res = []
    for k,g in itertools.groupby(rows, operator.itemgetter(0,1)):
        res.append(list(k) + [sum(map(z[0], z[1:])) for z in
            zip(ColFun, zip(*[c[2:] for c in g]))]
            )
    print(res == rows_result)
    # True
    print(res)
    # [[1, 'date1', 1.0, 3, 0.5, 1, 0, 2],
    #  [2, 'date1', 1.0, 6, 0.0, 0, 2, 2]]