I have a data source that looks approximately like this:
location available sold name local_id more_data
0 1001 - BBB 1 0 Alpha 24 'DJQ3DD3y'
1 1001 - BBB 1 0 Alpha 24 'aB3joXQy'
2 1001 - BBB 1 0 Alpha 24 'ZJve572B'
3 1001 - BBB 1 0 Alpha 24 'DJEkx8Dy'
4 1001 - BBB 1 0 Alpha 24 'VyVaWLYp'
5 1001 - BBB 1 0 Bravo 19 'Rpr7AvVy'
6 1001 - BBB 1 0 Bravo 19 'ZJlO0VmB'
7 1001 - BBB 1 0 Bravo 19 'OBb6NrrB'
8 1001 - BBB 1 0 Bravo 19 'ZJvPPEXy'
9 1001 - BBB 1 0 Charlie 6 'Vy9MMOEy'
10 1001 - BBB 1 0 Charlie 6 'MJ8AALKp'
11 1001 - BBB 1 0 Delta 17 'vpWmN1kB'
12 1001 - BBB 1 0 Delta 17 'DJEb9qQp'
13 1001 - BBB 1 0 Echo 7 'qyZ1zn1p'
14 1001 - BBB 1 0 Echo 7 'bBqaYoMB'
15 1001 - BBB 1 0 Golf 22 'AJgLr9qp'
16 1001 - BBB 1 0 Golf 22 'vBdV57Ap'
17 1001 - BBB 1 0 Golf 22 'VJYxnLZB'
18 1001 - GG 1029 237 Charlie 6 'VJYxnGXB'
19 1001 - GG 1029 237 Charlie 6 'Vy9Mo52y'
20 1001 - GG 1029 237 Delta 17 'aB3zxYWy'
21 1001 - GG 1029 237 Delta 17 'MJ8A3z0p'
22 1001 - GG 1029 237 Echo 7 'YpLMPwNy'
23 1001 - GG 1029 237 Echo 7 '8Bwev1ep'
24 1001 - GG 1029 237 Golf 22 'MJXm6bLp'
25 1001 - GG 1029 237 Golf 22 'oye7XR0J'
26 1001 - GG 1029 237 Golf 22 'vpWmDDYB'
27 1001 - P 873 375 Charlie 6 'DJEbjjkp'
28 1001 - P 873 375 Charlie 6 'aB3z66zy'
29 1001 - P 873 375 Delta 17 'Kp4zrrKB'
30 1001 - P 873 375 Delta 17 'oyxqMMAB'
31 1001 - P 873 375 Echo 7 'zJ1KMMZy'
32 1001 - P 873 375 Echo 7 'ZJlOzz6B'
33 1001 - P 873 375 Foxtrot 20 'YpLMbbay'
34 1001 - P 873 375 Foxtrot 20 'ZJnmzzYB'
35 1001 - P 873 375 Golf 22 'Kp4zr5LB'
36 1001 - P 873 375 Golf 22 'oye7jg8J'
37 1001 - P 873 375 Golf 22 'OBb6jE3B'
38 1002 - GG 37 11 Charlie 6 'EyGMWPbJ'
39 1002 - GG 37 11 Charlie 6 'aB3zOoDy'
40 1002 - GG 37 11 Delta 17 'DJQ4laLB'
41 1002 - GG 37 11 Delta 17 'ZJlOvNXB'
42 1002 - GG 37 11 Echo 7 'Rpr7a8Dy'
43 1002 - GG 37 11 Echo 7 'zJjYNR4B'
44 1002 - GG 37 11 Golf 22 'Vy9MqkRy'
45 1002 - GG 37 11 Golf 22 'oye7Y0kJ'
46 1002 - GG 37 11 Golf 22 '8BweZbnp'
47 1002 - P 1854 826 Charlie 6 'Rpr7Z5by'
48 1002 - P 1854 826 Charlie 6 'vBdVK1Ap'
49 1002 - P 1854 826 Delta 17 '4Jkae8Dy'
50 1002 - P 1854 826 Delta 17 'YpLM3nxy'
51 1002 - P 1854 826 Echo 7 'VB7vD6Py'
52 1002 - P 1854 826 Echo 7 'ZJlOXbzB'
53 1002 - P 1854 826 Foxtrot 20 'MpNqezKJ'
54 1002 - P 1854 826 Foxtrot 20 '9pOWo39p'
55 1002 - P 1854 826 Golf 22 'MJXm5qnp'
56 1002 - P 1854 826 Golf 22 'oy5vxd4B'
57 1002 - P 1854 826 Golf 22 'DJQ4qz3B'
As you can see, the data in the available
and sold
columns are presented in relation to the location
column, and related to the unique combination of numbers and letters in that column. What I am trying to do is group these by the first portion of the location
column (i.e., the "1001" portion of "1001 - XXX") and sum the unique values for available
and sold
, while preserving the other data which is unique per row and includes numeric types that are keys and I do not want altered.
Therefore, the output would look like the following:
location available sold name local_id more_data
0 1001 1903 612 Alpha 24 'DJQ3DD3y'
1 1001 1903 612 Alpha 24 'aB3joXQy'
2 1001 1903 612 Alpha 24 'ZJve572B'
3 1001 1903 612 Alpha 24 'DJEkx8Dy'
4 1001 1903 612 Alpha 24 'VyVaWLYp'
5 1001 1903 612 Bravo 19 'Rpr7AvVy'
6 1001 1903 612 Bravo 19 'ZJlO0VmB'
7 1001 1903 612 Bravo 19 'OBb6NrrB'
8 1001 1903 612 Bravo 19 'ZJvPPEXy'
9 1001 1903 612 Charlie 6 'Vy9MMOEy'
10 1001 1903 612 Charlie 6 'MJ8AALKp'
11 1001 1903 612 Delta 17 'vpWmN1kB'
12 1001 1903 612 Delta 17 'DJEb9qQp'
13 1001 1903 612 Echo 7 'qyZ1zn1p'
14 1001 1903 612 Echo 7 'bBqaYoMB'
15 1001 1903 612 Golf 22 'AJgLr9qp'
16 1001 1903 612 Golf 22 'vBdV57Ap'
17 1001 1903 612 Golf 22 'VJYxnLZB'
18 1001 1903 612 Charlie 6 'VJYxnGXB'
19 1001 1903 612 Charlie 6 'Vy9Mo52y'
20 1001 1903 612 Delta 17 'aB3zxYWy'
21 1001 1903 612 Delta 17 'MJ8A3z0p'
22 1001 1903 612 Echo 7 'YpLMPwNy'
23 1001 1903 612 Echo 7 '8Bwev1ep'
24 1001 1903 612 Golf 22 'MJXm6bLp'
25 1001 1903 612 Golf 22 'oye7XR0J'
26 1001 1903 612 Golf 22 'vpWmDDYB'
27 1001 1903 612 Charlie 6 'DJEbjjkp'
28 1001 1903 612 Charlie 6 'aB3z66zy'
29 1001 1903 612 Delta 17 'Kp4zrrKB'
30 1001 1903 612 Delta 17 'oyxqMMAB'
31 1001 1903 612 Echo 7 'zJ1KMMZy'
32 1001 1903 612 Echo 7 'ZJlOzz6B'
33 1001 1903 612 Foxtrot 20 'YpLMbbay'
34 1001 1903 612 Foxtrot 20 'ZJnmzzYB'
35 1001 1903 612 Golf 22 'Kp4zr5LB'
36 1001 1903 612 Golf 22 'oye7jg8J'
37 1001 1903 612 Golf 22 'OBb6jE3B'
38 1002 1891 837 Charlie 6 'EyGMWPbJ'
39 1002 1891 837 Charlie 6 'aB3zOoDy'
40 1002 1891 837 Delta 17 'DJQ4laLB'
41 1002 1891 837 Delta 17 'ZJlOvNXB'
42 1002 1891 837 Echo 7 'Rpr7a8Dy'
43 1002 1891 837 Echo 7 'zJjYNR4B'
44 1002 1891 837 Golf 22 'Vy9MqkRy'
45 1002 1891 837 Golf 22 'oye7Y0kJ'
46 1002 1891 837 Golf 22 '8BweZbnp'
47 1002 1891 837 Charlie 6 'Rpr7Z5by'
48 1002 1891 837 Charlie 6 'vBdVK1Ap'
49 1002 1891 837 Delta 17 '4Jkae8Dy'
50 1002 1891 837 Delta 17 'YpLM3nxy'
51 1002 1891 837 Echo 7 'VB7vD6Py'
52 1002 1891 837 Echo 7 'ZJlOXbzB'
53 1002 1891 837 Foxtrot 20 'MpNqezKJ'
54 1002 1891 837 Foxtrot 20 '9pOWo39p'
55 1002 1891 837 Golf 22 'MJXm5qnp'
56 1002 1891 837 Golf 22 'oy5vxd4B'
57 1002 1891 837 Golf 22 'DJQ4qz3B'
I know that I'm missing something super-easy because this kind of problem is in panda's wheelhouse. But sadly, it's where I am on the learning curve; I hope someone can steer me in the right direction.
You can extract
the leading part in location, then compute a groupby.sum
on the unique rows (obtained with drop_duplicates
) and reindex
to the original shape:
df['location2'] = df['location'].str.extract(r'(\d+)')
df[['available', 'sold']] = (df.drop_duplicates('location')
.groupby('location2')[['available', 'sold']].sum()
.reindex(df['location2']).to_numpy()
)
df['location'] = df.pop('location2')
Output:
location available sold name local_id more_data
0 1001 1903 612 Alpha 24 'DJQ3DD3y'
1 1001 1903 612 Alpha 24 'aB3joXQy'
2 1001 1903 612 Alpha 24 'ZJve572B'
3 1001 1903 612 Alpha 24 'DJEkx8Dy'
4 1001 1903 612 Alpha 24 'VyVaWLYp'
5 1001 1903 612 Bravo 19 'Rpr7AvVy'
6 1001 1903 612 Bravo 19 'ZJlO0VmB'
7 1001 1903 612 Bravo 19 'OBb6NrrB'
8 1001 1903 612 Bravo 19 'ZJvPPEXy'
9 1001 1903 612 Charlie 6 'Vy9MMOEy'
10 1001 1903 612 Charlie 6 'MJ8AALKp'
11 1001 1903 612 Delta 17 'vpWmN1kB'
12 1001 1903 612 Delta 17 'DJEb9qQp'
13 1001 1903 612 Echo 7 'qyZ1zn1p'
14 1001 1903 612 Echo 7 'bBqaYoMB'
15 1001 1903 612 Golf 22 'AJgLr9qp'
16 1001 1903 612 Golf 22 'vBdV57Ap'
17 1001 1903 612 Golf 22 'VJYxnLZB'
18 1001 1903 612 Charlie 6 'VJYxnGXB'
19 1001 1903 612 Charlie 6 'Vy9Mo52y'
20 1001 1903 612 Delta 17 'aB3zxYWy'
21 1001 1903 612 Delta 17 'MJ8A3z0p'
22 1001 1903 612 Echo 7 'YpLMPwNy'
23 1001 1903 612 Echo 7 '8Bwev1ep'
24 1001 1903 612 Golf 22 'MJXm6bLp'
25 1001 1903 612 Golf 22 'oye7XR0J'
26 1001 1903 612 Golf 22 'vpWmDDYB'
27 1001 1903 612 Charlie 6 'DJEbjjkp'
28 1001 1903 612 Charlie 6 'aB3z66zy'
29 1001 1903 612 Delta 17 'Kp4zrrKB'
30 1001 1903 612 Delta 17 'oyxqMMAB'
31 1001 1903 612 Echo 7 'zJ1KMMZy'
32 1001 1903 612 Echo 7 'ZJlOzz6B'
33 1001 1903 612 Foxtrot 20 'YpLMbbay'
34 1001 1903 612 Foxtrot 20 'ZJnmzzYB'
35 1001 1903 612 Golf 22 'Kp4zr5LB'
36 1001 1903 612 Golf 22 'oye7jg8J'
37 1001 1903 612 Golf 22 'OBb6jE3B'
38 1002 1891 837 Charlie 6 'EyGMWPbJ'
39 1002 1891 837 Charlie 6 'aB3zOoDy'
40 1002 1891 837 Delta 17 'DJQ4laLB'
41 1002 1891 837 Delta 17 'ZJlOvNXB'
42 1002 1891 837 Echo 7 'Rpr7a8Dy'
43 1002 1891 837 Echo 7 'zJjYNR4B'
44 1002 1891 837 Golf 22 'Vy9MqkRy'
45 1002 1891 837 Golf 22 'oye7Y0kJ'
46 1002 1891 837 Golf 22 '8BweZbnp'
47 1002 1891 837 Charlie 6 'Rpr7Z5by'
48 1002 1891 837 Charlie 6 'vBdVK1Ap'
49 1002 1891 837 Delta 17 '4Jkae8Dy'
50 1002 1891 837 Delta 17 'YpLM3nxy'
51 1002 1891 837 Echo 7 'VB7vD6Py'
52 1002 1891 837 Echo 7 'ZJlOXbzB'
53 1002 1891 837 Foxtrot 20 'MpNqezKJ'
54 1002 1891 837 Foxtrot 20 '9pOWo39p'
55 1002 1891 837 Golf 22 'MJXm5qnp'
56 1002 1891 837 Golf 22 'oy5vxd4B'
57 1002 1891 837 Golf 22 'DJQ4qz3B'