I have given an array/tensor of n values where each consists of a feature vector (in the example be it the first 4 values) and a positional vector (in the example the 5th value). Thus the whole array here is of shape (n, 5).
[ 1 2 3 4 *0* ]
[ 5 1 0 1 *1* ]
[ 0 1 0 1 *1* ]
[ 1 0 3 0 *2* ]
[ 1 1 2 6 *2* ]
[ 0 1 0 2 *2* ]
My goal is to pool (max or sum or avg) the values along the first dimension according to their positional vector. I.e. all rows with the same positional vector (here the 5th value) shall be combined given some symmetric function (lets say sum()) while keeping said 5th value constant. Resulting in a new array of shape (n', 5)
[ 1 2 3 4 *0* ]
[ 5 2 0 2 *1* ]
[ 2 2 5 8 *2* ]
Naturally this could be achieved with looping over the array and saving all of them into a dict with key, value = positional_vector, sum(feature_vector, dict[positional_vector])
and then converting it back to an array.
Unfortunately this method seems rather slow and as I plan to utilize this in training of a neural net it appears more sensible to use some tensor/matrix multiplication magic.
I highly appreciate any helpful comments :)
[Opposed to the given example the positional vector may be n dimensional and is not ordered.]
So this is a crude answer based on the diff
method I mentioned in the comments. Note that since you need agg operation based on the groups, there is no true way to truly vectorize it in an efficient manner. Also, this example is assuming that your data are sorted, we'll get back to it later.
def reduce(x): return np.r_[x[:,:-1].sum(axis=0), x[0,-1]]
x = np.array([[ 1, 2, 3, 4, 0 ],
[ 5, 1, 0, 1, 1 ],
[ 0, 1, 0, 1, 1 ],
[ 1, 0, 3, 0, 2 ],
[ 1, 1, 2, 6, 2 ],
[ 0, 1, 0, 2, 2 ] ])
ind = np.where(np.diff(x[:,-1], prepend=x[0,-1]))[0]
x_agg = np.array([reduce(i) for i in np.split(x, ind)])
The code is simple, it finds the indices where the values of the last row have changed, splits the array on those locations and agglomerates it as you want.
Now if the data is not sorted in the last row two cases arise:
np.where(np.diff(...)!=0)
Hope this helps.