Add numpy array elements/slices with same bin assignment

I have some array A, and the corresponding elements of the array bins contain each row's bin assignment. I want to construct an array S, such that

S[0, :] = (A[(bins == 0), :]).sum(axis=0)

This is rather easy to do with np.stack and list comprehensions, but it seems overly complicated and not terribly readable. Is there a more general way to sum (or even apply some general function to) slices of arrays with bin assignments? scipy.stats.binned_statistic is along the right lines, but requires that bin assignments and values to compute the functions on are the same shape (since I am using slices, this is not the case).

For example, if

A = np.array([[1., 2., 3., 4.],
              [2., 3., 4., 5.],
              [9., 8., 7., 6.],
              [8., 7., 6., 5.]])

and

bins = np.array([0, 1, 0, 2])

then it should result in

S = np.array([[10., 10., 10., 10.],
              [2.,  3.,  4.,  5. ],
              [8.,  7.,  6.,  5. ]])

Solution

Here's an approach with matrix-multiplication using np.dot -

(bins == np.arange(bins.max()+1)[:,None]).dot(A)

Sample run -

In [40]: A = np.array([[1., 2., 3., 4.],
    ...:               [2., 3., 4., 5.],
    ...:               [9., 8., 7., 6.],
    ...:               [8., 7., 6., 5.]])

In [41]: bins = np.array([0, 1, 0, 2])

In [42]: (bins == np.arange(bins.max()+1)[:,None]).dot(A)
Out[42]: 
array([[ 10.,  10.,  10.,  10.],
       [  2.,   3.,   4.,   5.],
       [  8.,   7.,   6.,   5.]])

Performance boost

A more efficient way to create the mask (bins == np.arange(bins.max()+1)[:,None]), would be like so -

mask = np.zeros((bins.max()+1, len(bins)), dtype=bool)
mask[bins, np.arange(len(bins))] = 1