Imagine we have a DataFrame like this one:
df = pd.DataFrame(np.array([[284.77, 234.37, 243.8, 84.36, 0., 0., 0., 55.04, 228.2, 181.97, 0., 0.],
[13.78, 0., 38.58, 33.16, 0., 38.04, 74.02, 45.74, 27.2, 9.19, 0., 0.],
[88.66, 255.72, 323.19, 7.24, 0., 73.38, 45.73, 0., 0., 77.39, 26.57, 279.34],
[0., 0., 34.42, 9.16, 0., 43.4, 42.17, 123.69, 60.5, 25.47, 72.32, 7.29],
[320.6, 1445.56, 856.23, 371.21, 0., 244.22, 134.58, 631.59, 561.82, 1172.44, 895.68, 186.28],
[0., 0., 32.29, 1000.91, 0., 680., 585.46, 466.6, 0., 493.48, 157.1, 125.31]]),
columns=[1,2,3,4,5,6,7,8,9,10,11,12])
df['Lists_to_sum'] = [[1,2,3,4],
[4,6,8,9,10,11],
[2],
[3,4,5,6,7,8,9,10,11],
[1,2,3,4,5,6,7,8,9],
[2,3,4,5,6,7,8,9,10,11,12],]
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Lists_to_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284.77 | 234.37 | 243.80 | 84.36 | 0.00 | 0.00 | 0.00 | 55.04 | 228.20 | 181.97 | 0.00 | 0.00 | [1, 2, 3, 4] |
1 | 13.78 | 0.00 | 38.58 | 33.16 | 0.00 | 38.04 | 74.02 | 45.74 | 27.20 | 9.19 | 0.00 | 0.00 | [4, 6, 8, 9, 10, 11] |
2 | 88.66 | 255.72 | 323.19 | 7.24 | 0.00 | 73.38 | 45.73 | 0.00 | 0.00 | 77.39 | 26.57 | 279.34 | [2] |
3 | 0.00 | 0.00 | 34.42 | 9.16 | 0.00 | 43.40 | 42.17 | 123.69 | 60.50 | 25.47 | 72.32 | 7.29 | [3, 4, 5, 6, 7, 8, 9, 10, 11] |
4 | 320.60 | 1445.56 | 856.23 | 371.21 | 0.00 | 244.22 | 134.58 | 631.59 | 561.82 | 1172.44 | 895.68 | 186.28 | [1, 2, 3, 4, 5, 6, 7, 8, 9] |
5 | 0.00 | 0.00 | 32.29 | 1000.91 | 0.00 | 680.00 | 585.46 | 466.60 | 0.00 | 493.48 | 157.10 | 125.31 | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] |
Our goal is to add a column with the sum of each row, but there's a catch: we just want to add up the columns that are in the lists of each row, presented in the 'Lists_to_sum' column, that is, each row has a different set of columns to be added
The final DataFrame should look like this:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Lists_to_sum | SUM_per_ROW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284.77 | 234.37 | 243.80 | 84.36 | 0.00 | 0.00 | 0.00 | 55.04 | 228.20 | 181.97 | 0.00 | 0.00 | [1, 2, 3, 4] | 847.30 |
1 | 13.78 | 0.00 | 38.58 | 33.16 | 0.00 | 38.04 | 74.02 | 45.74 | 27.20 | 9.19 | 0.00 | 0.00 | [4, 6, 8, 9, 10, 11] | 153.33 |
2 | 88.66 | 255.72 | 323.19 | 7.24 | 0.00 | 73.38 | 45.73 | 0.00 | 0.00 | 77.39 | 26.57 | 279.34 | [2] | 255.72 |
3 | 0.00 | 0.00 | 34.42 | 9.16 | 0.00 | 43.40 | 42.17 | 123.69 | 60.50 | 25.47 | 72.32 | 7.29 | [3, 4, 5, 6, 7, 8, 9, 10, 11] | 411.13 |
4 | 320.60 | 1445.56 | 856.23 | 371.21 | 0.00 | 244.22 | 134.58 | 631.59 | 561.82 | 1172.44 | 895.68 | 186.28 | [1, 2, 3, 4, 5, 6, 7, 8, 9] | 4565.81 |
5 | 0.00 | 0.00 | 32.29 | 1000.91 | 0.00 | 680.00 | 585.46 | 466.60 | 0.00 | 493.48 | 157.10 | 125.31 | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] | 3541.15 |
Did I mention there was a catch? There are actually two: we need vectorized performance for the solution.
I tried to solve this problem with a naïve approach, which worked to some extent: it gives me the results I want. But it performs the worst, since the code iterates over each row. For a tiny DataFrame like the one in the example, that's fine, but implementing this solution on a huge DataFrame with tens of millions of rows, well, that's another story.
for n,m,i in zip(df.index, df['Lists_to_sum'], range(0, df.shape[0])):
df.at[n,'SUM_per_ROW'] = df[m][i:i+1].sum(axis=1)
print('{} de {}'.format(i+1, df.shape[0]))
So, I wonder if there's a better way to solve this problem... Can you help me?
Code
make condition by explode
and get_dummies
(and groupby, any) and boolean masking (and sum)
cond = pd.get_dummies(df['Lists_to_sum'].explode()).groupby(level=0).any()
df['SUM_per_ROW'] = df[cond].sum(axis=1)
df