Search code examples
pythonpandasdataframegroup-by

how to sum a groupby in pandas with repeating data, while preserving other columns


I have a data source that looks approximately like this:

      location  available  sold     name  local_id    more_data
0   1001 - BBB          1     0    Alpha        24   'DJQ3DD3y'
1   1001 - BBB          1     0    Alpha        24   'aB3joXQy'
2   1001 - BBB          1     0    Alpha        24   'ZJve572B'
3   1001 - BBB          1     0    Alpha        24   'DJEkx8Dy'
4   1001 - BBB          1     0    Alpha        24   'VyVaWLYp'
5   1001 - BBB          1     0    Bravo        19   'Rpr7AvVy'
6   1001 - BBB          1     0    Bravo        19   'ZJlO0VmB'
7   1001 - BBB          1     0    Bravo        19   'OBb6NrrB'
8   1001 - BBB          1     0    Bravo        19   'ZJvPPEXy'
9   1001 - BBB          1     0  Charlie         6   'Vy9MMOEy'
10  1001 - BBB          1     0  Charlie         6   'MJ8AALKp'
11  1001 - BBB          1     0    Delta        17   'vpWmN1kB'
12  1001 - BBB          1     0    Delta        17   'DJEb9qQp'
13  1001 - BBB          1     0     Echo         7   'qyZ1zn1p'
14  1001 - BBB          1     0     Echo         7   'bBqaYoMB'
15  1001 - BBB          1     0     Golf        22   'AJgLr9qp'
16  1001 - BBB          1     0     Golf        22   'vBdV57Ap'
17  1001 - BBB          1     0     Golf        22   'VJYxnLZB'
18   1001 - GG       1029   237  Charlie         6   'VJYxnGXB'
19   1001 - GG       1029   237  Charlie         6   'Vy9Mo52y'
20   1001 - GG       1029   237    Delta        17   'aB3zxYWy'
21   1001 - GG       1029   237    Delta        17   'MJ8A3z0p'
22   1001 - GG       1029   237     Echo         7   'YpLMPwNy'
23   1001 - GG       1029   237     Echo         7   '8Bwev1ep'
24   1001 - GG       1029   237     Golf        22   'MJXm6bLp'
25   1001 - GG       1029   237     Golf        22   'oye7XR0J'
26   1001 - GG       1029   237     Golf        22   'vpWmDDYB'
27    1001 - P        873   375  Charlie         6   'DJEbjjkp'
28    1001 - P        873   375  Charlie         6   'aB3z66zy'
29    1001 - P        873   375    Delta        17   'Kp4zrrKB'
30    1001 - P        873   375    Delta        17   'oyxqMMAB'
31    1001 - P        873   375     Echo         7   'zJ1KMMZy'
32    1001 - P        873   375     Echo         7   'ZJlOzz6B'
33    1001 - P        873   375  Foxtrot        20   'YpLMbbay'
34    1001 - P        873   375  Foxtrot        20   'ZJnmzzYB'
35    1001 - P        873   375     Golf        22   'Kp4zr5LB'
36    1001 - P        873   375     Golf        22   'oye7jg8J'
37    1001 - P        873   375     Golf        22   'OBb6jE3B'
38   1002 - GG         37    11  Charlie         6   'EyGMWPbJ'
39   1002 - GG         37    11  Charlie         6   'aB3zOoDy'
40   1002 - GG         37    11    Delta        17   'DJQ4laLB'
41   1002 - GG         37    11    Delta        17   'ZJlOvNXB'
42   1002 - GG         37    11     Echo         7   'Rpr7a8Dy'
43   1002 - GG         37    11     Echo         7   'zJjYNR4B'
44   1002 - GG         37    11     Golf        22   'Vy9MqkRy'
45   1002 - GG         37    11     Golf        22   'oye7Y0kJ'
46   1002 - GG         37    11     Golf        22   '8BweZbnp'
47    1002 - P       1854   826  Charlie         6   'Rpr7Z5by'
48    1002 - P       1854   826  Charlie         6   'vBdVK1Ap'
49    1002 - P       1854   826    Delta        17   '4Jkae8Dy'
50    1002 - P       1854   826    Delta        17   'YpLM3nxy'
51    1002 - P       1854   826     Echo         7   'VB7vD6Py'
52    1002 - P       1854   826     Echo         7   'ZJlOXbzB'
53    1002 - P       1854   826  Foxtrot        20   'MpNqezKJ'
54    1002 - P       1854   826  Foxtrot        20   '9pOWo39p'
55    1002 - P       1854   826     Golf        22   'MJXm5qnp'
56    1002 - P       1854   826     Golf        22   'oy5vxd4B'
57    1002 - P       1854   826     Golf        22   'DJQ4qz3B'

As you can see, the data in the available and sold columns are presented in relation to the location column, and related to the unique combination of numbers and letters in that column. What I am trying to do is group these by the first portion of the location column (i.e., the "1001" portion of "1001 - XXX") and sum the unique values for available and sold, while preserving the other data which is unique per row and includes numeric types that are keys and I do not want altered.

Therefore, the output would look like the following:

    location  available  sold     name  local_id    more_data
0       1001       1903   612    Alpha        24   'DJQ3DD3y'
1       1001       1903   612    Alpha        24   'aB3joXQy'
2       1001       1903   612    Alpha        24   'ZJve572B'
3       1001       1903   612    Alpha        24   'DJEkx8Dy'
4       1001       1903   612    Alpha        24   'VyVaWLYp'
5       1001       1903   612    Bravo        19   'Rpr7AvVy'
6       1001       1903   612    Bravo        19   'ZJlO0VmB'
7       1001       1903   612    Bravo        19   'OBb6NrrB'
8       1001       1903   612    Bravo        19   'ZJvPPEXy'
9       1001       1903   612  Charlie         6   'Vy9MMOEy'
10      1001       1903   612  Charlie         6   'MJ8AALKp'
11      1001       1903   612    Delta        17   'vpWmN1kB'
12      1001       1903   612    Delta        17   'DJEb9qQp'
13      1001       1903   612     Echo         7   'qyZ1zn1p'
14      1001       1903   612     Echo         7   'bBqaYoMB'
15      1001       1903   612     Golf        22   'AJgLr9qp'
16      1001       1903   612     Golf        22   'vBdV57Ap'
17      1001       1903   612     Golf        22   'VJYxnLZB'
18      1001       1903   612  Charlie         6   'VJYxnGXB'
19      1001       1903   612  Charlie         6   'Vy9Mo52y'
20      1001       1903   612    Delta        17   'aB3zxYWy'
21      1001       1903   612    Delta        17   'MJ8A3z0p'
22      1001       1903   612     Echo         7   'YpLMPwNy'
23      1001       1903   612     Echo         7   '8Bwev1ep'
24      1001       1903   612     Golf        22   'MJXm6bLp'
25      1001       1903   612     Golf        22   'oye7XR0J'
26      1001       1903   612     Golf        22   'vpWmDDYB'
27      1001       1903   612  Charlie         6   'DJEbjjkp'
28      1001       1903   612  Charlie         6   'aB3z66zy'
29      1001       1903   612    Delta        17   'Kp4zrrKB'
30      1001       1903   612    Delta        17   'oyxqMMAB'
31      1001       1903   612     Echo         7   'zJ1KMMZy'
32      1001       1903   612     Echo         7   'ZJlOzz6B'
33      1001       1903   612  Foxtrot        20   'YpLMbbay'
34      1001       1903   612  Foxtrot        20   'ZJnmzzYB'
35      1001       1903   612     Golf        22   'Kp4zr5LB'
36      1001       1903   612     Golf        22   'oye7jg8J'
37      1001       1903   612     Golf        22   'OBb6jE3B'
38      1002       1891   837  Charlie         6   'EyGMWPbJ'
39      1002       1891   837  Charlie         6   'aB3zOoDy'
40      1002       1891   837    Delta        17   'DJQ4laLB'
41      1002       1891   837    Delta        17   'ZJlOvNXB'
42      1002       1891   837     Echo         7   'Rpr7a8Dy'
43      1002       1891   837     Echo         7   'zJjYNR4B'
44      1002       1891   837     Golf        22   'Vy9MqkRy'
45      1002       1891   837     Golf        22   'oye7Y0kJ'
46      1002       1891   837     Golf        22   '8BweZbnp'
47      1002       1891   837  Charlie         6   'Rpr7Z5by'
48      1002       1891   837  Charlie         6   'vBdVK1Ap'
49      1002       1891   837    Delta        17   '4Jkae8Dy'
50      1002       1891   837    Delta        17   'YpLM3nxy'
51      1002       1891   837     Echo         7   'VB7vD6Py'
52      1002       1891   837     Echo         7   'ZJlOXbzB'
53      1002       1891   837  Foxtrot        20   'MpNqezKJ'
54      1002       1891   837  Foxtrot        20   '9pOWo39p'
55      1002       1891   837     Golf        22   'MJXm5qnp'
56      1002       1891   837     Golf        22   'oy5vxd4B'
57      1002       1891   837     Golf        22   'DJQ4qz3B'

I know that I'm missing something super-easy because this kind of problem is in panda's wheelhouse. But sadly, it's where I am on the learning curve; I hope someone can steer me in the right direction.


Solution

  • You can extract the leading part in location, then compute a groupby.sum on the unique rows (obtained with drop_duplicates) and reindex to the original shape:

    df['location2'] = df['location'].str.extract(r'(\d+)')
    df[['available', 'sold']] = (df.drop_duplicates('location')
                                   .groupby('location2')[['available', 'sold']].sum()
                                   .reindex(df['location2']).to_numpy()
                                )
    df['location'] = df.pop('location2')
    

    Output:

       location  available  sold     name  local_id   more_data
    0      1001       1903   612    Alpha        24  'DJQ3DD3y'
    1      1001       1903   612    Alpha        24  'aB3joXQy'
    2      1001       1903   612    Alpha        24  'ZJve572B'
    3      1001       1903   612    Alpha        24  'DJEkx8Dy'
    4      1001       1903   612    Alpha        24  'VyVaWLYp'
    5      1001       1903   612    Bravo        19  'Rpr7AvVy'
    6      1001       1903   612    Bravo        19  'ZJlO0VmB'
    7      1001       1903   612    Bravo        19  'OBb6NrrB'
    8      1001       1903   612    Bravo        19  'ZJvPPEXy'
    9      1001       1903   612  Charlie         6  'Vy9MMOEy'
    10     1001       1903   612  Charlie         6  'MJ8AALKp'
    11     1001       1903   612    Delta        17  'vpWmN1kB'
    12     1001       1903   612    Delta        17  'DJEb9qQp'
    13     1001       1903   612     Echo         7  'qyZ1zn1p'
    14     1001       1903   612     Echo         7  'bBqaYoMB'
    15     1001       1903   612     Golf        22  'AJgLr9qp'
    16     1001       1903   612     Golf        22  'vBdV57Ap'
    17     1001       1903   612     Golf        22  'VJYxnLZB'
    18     1001       1903   612  Charlie         6  'VJYxnGXB'
    19     1001       1903   612  Charlie         6  'Vy9Mo52y'
    20     1001       1903   612    Delta        17  'aB3zxYWy'
    21     1001       1903   612    Delta        17  'MJ8A3z0p'
    22     1001       1903   612     Echo         7  'YpLMPwNy'
    23     1001       1903   612     Echo         7  '8Bwev1ep'
    24     1001       1903   612     Golf        22  'MJXm6bLp'
    25     1001       1903   612     Golf        22  'oye7XR0J'
    26     1001       1903   612     Golf        22  'vpWmDDYB'
    27     1001       1903   612  Charlie         6  'DJEbjjkp'
    28     1001       1903   612  Charlie         6  'aB3z66zy'
    29     1001       1903   612    Delta        17  'Kp4zrrKB'
    30     1001       1903   612    Delta        17  'oyxqMMAB'
    31     1001       1903   612     Echo         7  'zJ1KMMZy'
    32     1001       1903   612     Echo         7  'ZJlOzz6B'
    33     1001       1903   612  Foxtrot        20  'YpLMbbay'
    34     1001       1903   612  Foxtrot        20  'ZJnmzzYB'
    35     1001       1903   612     Golf        22  'Kp4zr5LB'
    36     1001       1903   612     Golf        22  'oye7jg8J'
    37     1001       1903   612     Golf        22  'OBb6jE3B'
    38     1002       1891   837  Charlie         6  'EyGMWPbJ'
    39     1002       1891   837  Charlie         6  'aB3zOoDy'
    40     1002       1891   837    Delta        17  'DJQ4laLB'
    41     1002       1891   837    Delta        17  'ZJlOvNXB'
    42     1002       1891   837     Echo         7  'Rpr7a8Dy'
    43     1002       1891   837     Echo         7  'zJjYNR4B'
    44     1002       1891   837     Golf        22  'Vy9MqkRy'
    45     1002       1891   837     Golf        22  'oye7Y0kJ'
    46     1002       1891   837     Golf        22  '8BweZbnp'
    47     1002       1891   837  Charlie         6  'Rpr7Z5by'
    48     1002       1891   837  Charlie         6  'vBdVK1Ap'
    49     1002       1891   837    Delta        17  '4Jkae8Dy'
    50     1002       1891   837    Delta        17  'YpLM3nxy'
    51     1002       1891   837     Echo         7  'VB7vD6Py'
    52     1002       1891   837     Echo         7  'ZJlOXbzB'
    53     1002       1891   837  Foxtrot        20  'MpNqezKJ'
    54     1002       1891   837  Foxtrot        20  '9pOWo39p'
    55     1002       1891   837     Golf        22  'MJXm5qnp'
    56     1002       1891   837     Golf        22  'oy5vxd4B'
    57     1002       1891   837     Golf        22  'DJQ4qz3B'