I have a list of rows, with two key columns and 6 value
columns.
I want to calculate the sum, mean, and max
of the value columns, based on lists of columns for each function,
grouped by the key columns, using itertools.groupby
,
without using pandas.
columns = [ID, date, row1, row2, row3, row4, row5, row6]
rows = [[1, date1, 1, 1, 1, 0, 0, 1],
[1, date1, 1, 2, 0, 1, 0, 1],
[2, date1, 1, 3, 0, 0, 1, 1],
[2, date1, 1, 3, 0, 0, 2, 1]]
ColSum = [row2, row6]
ColMean = [row1, row3]
ColMax = [row4, row5]
The result I expect to calculate:
rows_result = [[1, date1, 1, 3, 0.5, 1, 0, 2],
[2, date1, 1, 6, 0, 0, 2, 2]]
My code:
for cols in ColSum:
index = int(np.where(columns == cols)[0][0])
for k, g in itertools.groupby(rows[:, index], operator.itemgetter(0,1)):
res.append((list(k) +list(map(sum, zip(*[c[2:] for c in g])))))
for colm in colMean:
index = int(np.where(columns == colm)[0][0])
for k, g in itertools.groupby(rows[:, index], operator.itemgetter(0,1)):
res.append((list(k) +list(map(sum, zip(*[c[2:] for c in g])))))
However, this doesn't work.
Calculating each of the grouped column results separately while constructing
the results list is pretty complicated. After initial construction for ColSum
,
you would have to edit each row-level element. It can be done, but I do not
think it fits with the spirit of functional programming.
Instead, the program below uses ColFun
to set the function to be applied to
each of the columns (equivalent to row1 through row6). In the for
loop,
it runs through each of the groups (like (1, 'date1')
), while constructing
a list of the 6 value columns with the requested function application on the
column.
The result is appended to a list, which matches the (slightly edited) desired result.
import itertools
import operator
import numpy as np
columns = ['ID', 'date', 'row1', 'row2', 'row3', 'row4', 'row5', 'row6']
rows = [[1, 'date1', 1, 1, 1, 0, 0, 1],
[1, 'date1', 1, 2, 0, 1, 0, 1],
[2, 'date1', 1, 3, 0, 0, 1, 1],
[2, 'date1', 1, 3, 0, 0, 2, 1]]
ColSum = ['row2', 'row6']
ColMean = ['row1', 'row3']
ColMax = ['row4', 'row5']
ColFun = [np.mean, np.sum, np.mean, np.max, np.max, np.sum]
rows_result = [[1, 'date1', 1.0, 3, 0.5, 1, 0, 2],
[2, 'date1', 1.0, 6, 0.0, 0, 2, 2]]
res = []
for k,g in itertools.groupby(rows, operator.itemgetter(0,1)):
res.append(list(k) + [sum(map(z[0], z[1:])) for z in
zip(ColFun, zip(*[c[2:] for c in g]))]
)
print(res == rows_result)
# True
print(res)
# [[1, 'date1', 1.0, 3, 0.5, 1, 0, 2],
# [2, 'date1', 1.0, 6, 0.0, 0, 2, 2]]