Merge Dataframe based on substring column labeld while keep the original columns label

I have a dataframe having columns with the a label pattern (name/startDateTime/endDateTime)

import pandas as pd
pd.DataFrame({
    "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
    "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
    "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [4],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T08:00:00Z": [5],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [6],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T08:00:00Z": [7],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [8],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T08:00:00Z": [9],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [10],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T08:00:00Z": [11],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z": [12],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T09:00:00Z": [13],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z": [14],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T09:00:00Z": [15],
})

I would like to merge the columns (summing its values) having same name and start date (without the time), the column name should be the orignal one (First to be used)

This should give the following result

pd.DataFrame({
        "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
        "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
        "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
        "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [9],
        "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [13],
        "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [17],
        "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [21],
        ...
    })

In my example, every column has one raw, but in reality it has multiple based on datetime index

Solution

There are a couple of ways to do this, since you are operating along the columns you can do this in either a pure pandas or more typical Python approach. The whole idea is that you need to find the unique groups in your columns and perform aggregation within those groups.

Pandas groupby(…)

Pandas (version <= 2.1.0) supported grouped operations along the columns, however this functionality has been deprecated. So for a pure pandas approach you may need to transpose your DataFrame (which can be an expensive operation depending) and plan like so:

import pandas as pd

groupings = (
    df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
)
unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])

result = (
    df.T
    .groupby([col.values for _, col in groupings.items()]).sum()
    .set_axis(unique['orig']).T
)

print(result.T) # transpose just for viewing output
#                                                      0
# orig
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   2
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   3
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   9
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...  13
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  17
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  21
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  25
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  29
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...   1

Pandas groupby(…, axis=1)

If you have an older version of pandas you can use .groupby(…, axis=1) this should free you of a possibly expensive transpose.

import pandas as pd

groupings = (
    df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
)
unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])

result = (
    df.groupby([col.values for _, col in groupings.items()], axis=1).sum()
    .set_axis(unique['orig'], axis=1)
)

print(result.T) # transpose just for viewing output
#                                                      0
# orig
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   2
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   3
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   9
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...  13
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  17
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  21
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  25
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  29
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...   1

More Python

Instead of relying on explicit .groupby operations you can create your groups in Python. Considering these operations are performed along each column this should be quite performant as well and does not require a possibly expensive transpose or a deprecate API.

The idea here is to use itertools to create the groupings, store the intermediate results in a dictionary and recreate a new DataFrame from those parts.

import pandas as pd
from itertools import groupby

def extract_unique(column):
    splitted = column.split('/')
    return splitted[0], splitted[1][:10]

result = {}
for _, col_group in groupby(sorted(df.columns), key=extract_unique):
    first, *remaining = col_group
    result[first] = df[[first, *remaining]].sum(axis=1)

result = pd.DataFrame(result)
print(result.T)
#                                                      0
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   1
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   2
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   3
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...   9
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  13
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  17
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  21
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  25
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...  29