python pandas numpy group-by pandas-groupby

Pandas groupby and weighted sum for multiple columns

I've see a dozen Pandas groupby multiple columns questions, but I'm at a loss on how to get this to run in a reasonable time. My goal is to groupby a few columns, and with the resulting subset apply np.dot across each remaining column against my weights:

# Example data:
weights = np.array([.20, .60, .20])
data = pd.DataFrame([[0, "TX", 10, 55], [0, "TX", 5, 30], [0, "TX", 2, 75], [1, "TX", 4, 30], [1, "TX", 8, 100], [1, "TX", 2, 30]], columns=["sim", "state", "x1", "x2"])

print(data)
   sim state  x1   x2
0    0    TX  10   55
1    0    TX   5   30
2    0    TX   2   75
3    1    TX   4   30
4    1    TX   8  100
5    1    TX   2   30

I couldn't get np.dot to work out of the box, so I had to break the multiplication and summation into separate steps. Here's what i've tried, but on my dataset of a few million rows this takes ~2 minutes, not to mention being pretty unreadable:

results = data.groupby(["sim", "state"]).apply(lambda sdf: (sdf[["x1", "x2"]] * weights.reshape((3,1))).sum())

print(results.reset_index())
   sim state   x1    x2
0    0    TX  5.4  44.0
1    1    TX  6.0  72.0

Solution

How about...

(df.set_index(['sim', 'state'])
   .mul(np.tile(weights, len(df) // len(weights)), axis=0)
   .sum(level=[0, 1]))

            x1    x2
sim state           
0   TX     5.4  44.0
1   TX     6.0  72.0

How this works,

set the index to whatever should not be multiplied (df's primary keys, essentially)
use mul to perform broadcasted multiplication with the weights
group on the indices and sum the weighted values.

This works under the assumption that len(df) % len(weights) == 0.