I have a dataframe having columns with the a label pattern (name/startDateTime/endDateTime)
import pandas as pd
pd.DataFrame({
"[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
"[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
"[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [4],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T08:00:00Z": [5],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [6],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T08:00:00Z": [7],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [8],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T08:00:00Z": [9],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [10],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T08:00:00Z": [11],
"[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z": [12],
"[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T09:00:00Z": [13],
"[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z": [14],
"[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T09:00:00Z": [15],
})
I would like to merge the columns (summing its values) having same name and start date (without the time), the column name should be the orignal one (First to be used)
This should give the following result
pd.DataFrame({
"[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
"[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
"[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [9],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [13],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [17],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [21],
...
})
In my example, every column has one raw, but in reality it has multiple based on datetime index
There are a couple of ways to do this, since you are operating along the columns you can do this in either a pure pandas or more typical Python approach. The whole idea is that you need to find the unique groups in your columns and perform aggregation within those groups.
Pandas (version <= 2.1.0) supported grouped operations along the columns, however this functionality has been deprecated. So for a pure pandas approach you may need to transpose your DataFrame (which can be an expensive operation depending) and plan like so:
import pandas as pd
groupings = (
df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
)
unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])
result = (
df.T
.groupby([col.values for _, col in groupings.items()]).sum()
.set_axis(unique['orig']).T
)
print(result.T) # transpose just for viewing output
# 0
# orig
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03... 2
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0... 3
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0... 9
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0... 13
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0... 17
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0... 21
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0... 25
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0... 29
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0... 1
If you have an older version of pandas you can use .groupby(…, axis=1)
this should free you of a possibly expensive transpose.
import pandas as pd
groupings = (
df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
)
unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])
result = (
df.groupby([col.values for _, col in groupings.items()], axis=1).sum()
.set_axis(unique['orig'], axis=1)
)
print(result.T) # transpose just for viewing output
# 0
# orig
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03... 2
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0... 3
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0... 9
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0... 13
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0... 17
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0... 21
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0... 25
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0... 29
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0... 1
Instead of relying on explicit .groupby
operations you can create your
groups in Python. Considering these operations are performed along each column
this should be quite performant as well and does not require a possibly
expensive transpose or a deprecate API.
The idea here is to use itertools
to create the groupings, store the intermediate
results in a dictionary and recreate a new DataFrame from those parts.
import pandas as pd
from itertools import groupby
def extract_unique(column):
splitted = column.split('/')
return splitted[0], splitted[1][:10]
result = {}
for _, col_group in groupby(sorted(df.columns), key=extract_unique):
first, *remaining = col_group
result[first] = df[[first, *remaining]].sum(axis=1)
result = pd.DataFrame(result)
print(result.T)
# 0
# [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03... 1
# [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0... 2
# [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0... 3
# [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0... 9
# [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0... 13
# [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0... 17
# [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0... 21
# [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0... 25
# [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0... 29