I've see a dozen Pandas groupby multiple columns questions, but I'm at a loss on how to get this to run in a reasonable time. My goal is to groupby a few columns, and with the resulting subset apply np.dot
across each remaining column against my weights:
# Example data:
weights = np.array([.20, .60, .20])
data = pd.DataFrame([[0, "TX", 10, 55], [0, "TX", 5, 30], [0, "TX", 2, 75], [1, "TX", 4, 30], [1, "TX", 8, 100], [1, "TX", 2, 30]], columns=["sim", "state", "x1", "x2"])
print(data)
sim state x1 x2
0 0 TX 10 55
1 0 TX 5 30
2 0 TX 2 75
3 1 TX 4 30
4 1 TX 8 100
5 1 TX 2 30
I couldn't get np.dot
to work out of the box, so I had to break the multiplication and summation into separate steps. Here's what i've tried, but on my dataset of a few million rows this takes ~2 minutes, not to mention being pretty unreadable:
results = data.groupby(["sim", "state"]).apply(lambda sdf: (sdf[["x1", "x2"]] * weights.reshape((3,1))).sum())
print(results.reset_index())
sim state x1 x2
0 0 TX 5.4 44.0
1 1 TX 6.0 72.0
How about...
(df.set_index(['sim', 'state'])
.mul(np.tile(weights, len(df) // len(weights)), axis=0)
.sum(level=[0, 1]))
x1 x2
sim state
0 TX 5.4 44.0
1 TX 6.0 72.0
How this works,
df
's primary keys, essentially)mul
to perform broadcasted multiplication with the weights This works under the assumption that len(df) % len(weights) == 0
.