Search code examples
pythonpandasdataframesortingmerge

Merge Dataframe based on substring column labeld while keep the original columns label


I have a dataframe having columns with the a label pattern (name/startDateTime/endDateTime)

import pandas as pd
pd.DataFrame({
    "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
    "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
    "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [4],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T08:00:00Z": [5],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [6],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T08:00:00Z": [7],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [8],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T08:00:00Z": [9],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [10],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T08:00:00Z": [11],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z": [12],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T09:00:00Z": [13],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z": [14],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T09:00:00Z": [15],
})

I would like to merge the columns (summing its values) having same name and start date (without the time), the column name should be the orignal one (First to be used)

This should give the following result

pd.DataFrame({
        "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
        "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
        "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
        "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [9],
        "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [13],
        "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [17],
        "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [21],
        ...
    })

In my example, every column has one raw, but in reality it has multiple based on datetime index


Solution

  • There are a couple of ways to do this, since you are operating along the columns you can do this in either a pure pandas or more typical Python approach. The whole idea is that you need to find the unique groups in your columns and perform aggregation within those groups.

    Pandas groupby(…)

    Pandas (version <= 2.1.0) supported grouped operations along the columns, however this functionality has been deprecated. So for a pure pandas approach you may need to transpose your DataFrame (which can be an expensive operation depending) and plan like so:

    import pandas as pd
    
    groupings = (
        df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
    )
    unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])
    
    result = (
        df.T
        .groupby([col.values for _, col in groupings.items()]).sum()
        .set_axis(unique['orig']).T
    )
    
    print(result.T) # transpose just for viewing output
    #                                                      0
    # orig
    # [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   2
    # [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   3
    # [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   9
    # [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...  13
    # [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  17
    # [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  21
    # [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  25
    # [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  29
    # [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...   1
    

    Pandas groupby(…, axis=1)

    If you have an older version of pandas you can use .groupby(…, axis=1) this should free you of a possibly expensive transpose.

    import pandas as pd
    
    groupings = (
        df.columns.str.extract(r'([^/]+)/(\d{4}-\d{2}-\d{2})')
    )
    unique = groupings.assign(orig=df.columns).drop_duplicates([0, 1])
    
    result = (
        df.groupby([col.values for _, col in groupings.items()], axis=1).sum()
        .set_axis(unique['orig'], axis=1)
    )
    
    print(result.T) # transpose just for viewing output
    #                                                      0
    # orig
    # [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   2
    # [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   3
    # [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   9
    # [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...  13
    # [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  17
    # [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  21
    # [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  25
    # [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  29
    # [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...   1
    

    More Python

    Instead of relying on explicit .groupby operations you can create your groups in Python. Considering these operations are performed along each column this should be quite performant as well and does not require a possibly expensive transpose or a deprecate API.

    The idea here is to use itertools to create the groupings, store the intermediate results in a dictionary and recreate a new DataFrame from those parts.

    import pandas as pd
    from itertools import groupby
    
    def extract_unique(column):
        splitted = column.split('/')
        return splitted[0], splitted[1][:10]
    
    result = {}
    for _, col_group in groupby(sorted(df.columns), key=extract_unique):
        first, *remaining = col_group
        result[first] = df[[first, *remaining]].sum(axis=1)
    
    result = pd.DataFrame(result)
    print(result.T)
    #                                                      0
    # [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03...   1
    # [RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:0...   2
    # [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:0...   3
    # [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:0...   9
    # [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:0...  13
    # [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:0...  17
    # [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:0...  21
    # [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:0...  25
    # [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:0...  29