Search code examples
pythonpandasdataframe

Add values of two Dataframes based on similar row values


EDITED FOR CLARITY

I am working on regional electricity system model currently I have two dataframes, one that has information of municipality and energy node, the other is the energy demand over time in a municipality, illustrated below (not the actual df that I use)

municip = {
    "muni_id": [1401, 1402, 1407, 1415, 1419, 1480, 1480, 1427, 1484],
    "muni_name": ["Har", "Par", "Ock", "Ste", "Tjo", "Gbg", "Gbg", "Sot", "Lys"],
    "new_muni_id": [1401, 1402, 1480, 1415, 1415, 1480, 1480, 1484, 1484],
    "new_muni_name": ["Har", "Har", "Gbg", "Ste", "Ste", "Gbg", "Gbg", "Lys", "Lys"],
    "new_node_id": ["HAR1", "PAR1", "GBG2", "STE1", "STE1", "GBG1", "GBG2", "LYS1", "LYS1"]
}

df_1 = pd.DataFrame(municip)

demand = {
    "period": [1, 2, 3, 4, 5],
    1401: [2, 4, 4, 1, 2],
    1402: [1, 1, 3, 3, 5],
    1407: [2, 4, 4, 1, 2],
    1415: [1, 1, 3, 3, 5],
    1419: [1, 1, 3, 3, 5],
    1480: [1, 1, 3, 3, 5],
    1427: [2, 4, 4, 1, 2],
    1484: [1, 2, 3, 4, 5]
}

df_2 = pd.DataFrame(demand)

there are "old" and "new" muni_id and name because some municipality may not have node, therefore I aggregated the demand to the nearest node. df_2 is demand in the whole municipality, with the number in columns indicate muni_id.

my intention is to assign the municipal demand into energy node in a uniform distribution manner (as a start for simplicity)

I tried the following to achieve:

  1. create a new dataframe with columns of period and list of nodes

  2. assign demand on a municipality that has same old and new id/name, divided with the number of nodes within that same municipality, for example above Gbg has 2 nodes (GBG1 and GBG2), therefore the demand on GBG1 and GBG2 will be demand of 1480 (Gbg) divided by two.

  3. for municipality that have different old and new id/name, the demand of old muni id will be added to the demand of the assigned node based on the previous step. for example in df_1 muni_id 1407 (Ock) is assigned to GBG2 node, therefore the demand of GBG2 in the first period will be 2.5, where 0.5 is from the previous step (Gbg demand / 2), 2 from the current step (demand of Ock)

import pandas as pd

def profiling(df_1, df_2):
    
    # create empty df
    nodes = df_1["new_node_id"].unique()
    node_df = pd.DataFrame(columns=["period"] + list(nodes))
    node_df["period"] = df_2["period"]

    # loop based on node id
    for node in nodes:
        node_info = df_1[df_1["new_node_id"] == node]
        muni_id = node_info["muni_id"].values[0]
        new_muni_id = node_info["new_muni_id"].values[0]

        # assign demand to all nodes that has same old/new id
        if muni_id == new_muni_id :
            cumul_node = len(df_1[df_1["muni_id"] == muni_id])
            node_df[node] = df_2[muni_id] / cumul_node 
        
        # assign demand to node for muni that has different old/new id
        else:
            cumul_node = len(df_1[df_1["new_muni_id"] == new_muni_id])         
            add_demand = df_2[muni_id] / cumul_node

            if node in node_df:
                node_df[node] += add_demand
            else:
                node_df[node] = add_demand

    return node_df

However, this results in some nodes return a NaN values, specifically municipalities that have different old/new id, like below table

period HAR1 PAR1 GBG2 STE1 GBG1 LYS1
1 2 1 NaN 1 0.5 NaN
2 4 1 NaN 1 0.5 NaN
3 4 3 NaN 3 1.5 NaN
4 1 3 NaN 3 1.5 NaN
5 2 5 NaN 5 2.5 NaN

from my logic, hourly demand of GBG2 should be half of 1480 (Gbg muni) + 1407 (Ock muni), since Ock is assigned GBG2 as new node, i.e. 2.5 in the first period. The same goes for LYS1 node, where it will be the total demand of 1427 (Sot) and 1484 (Lys).

So I expect to get something like this

period HAR1 PAR1 GBG2 STE1 GBG1 LYS1
1 2 1 2.5 1 0.5 3
2 4 1 4.5 1 0.5 6
3 4 3 5.5 3 1.5 7
4 1 3 2.5 3 1.5 5
5 2 5 4.5 5 2.5 7

Could someone improve my function, or perhaps there is a flaw in my logic of adding the values?

Sorry for the lengthy question and thanks for the help in advance!


Solution

  • I got lost trying to solve it from the piece of code you provided with for loops, so here is my approach :

    def profiling(df_1, df_2):
    
        df_1['num_nodes'] = df_1.groupby('muni_id')['new_node_id'].transform('count')
    
        df_2_melted = df_2.melt(id_vars=['period'], var_name='muni_id', value_name='demand')
    
        merged_df = df_2_melted.merge(df_1, on='muni_id', how='left')
    
        merged_df['demand'] = merged_df['demand'] / merged_df['num_nodes']
    
        final_df = merged_df.pivot_table(index='period', columns='new_node_id', values='demand', aggfunc='sum').reset_index()
    
        final_df = final_df.rename_axis(None, axis=1)
    
        return final_df
    

    If we break down the code :

    1st step

    df_1['num_nodes'] = df_1.groupby('muni_id')['new_node_id'].transform('count')
    

    This counts the number of nodes in each group of muni_id nodes and return a Series which is assigned to a num_nodes column.

    2nd step

    df_2_melted = df_2.melt(id_vars=['period'], var_name='muni_id', value_name='demand')
    

    The use of melt creates a new DataFrame where a column named muni_id is added to switch from a n_period x n_muni table to a single column of values for demand on the period for a given node. Meaning that each row corresponds to a specific period and muni_id pair with its associated demand.

    3rd step

    merged_df = df_2_melted.merge(df_1, on='muni_id', how='left')
    

    This merges df_2_melted with df_1 using the muni_id column. This adds all the information from df_1 into df_2_melted for each period, muni_id and demand.

    4th step

    merged_df['demand'] = merged_df['demand'] / merged_df['num_nodes']
    

    This divides the demand by the number of nodes found at step 1.

    5th step

    final_df = merged_df.pivot_table(index='period', columns='new_node_id', values='demand', aggfunc='sum').reset_index()
    

    This creates a pivot table using period as rows and node_id as columns. The table is filled with values from the demand column and to ensure that old and new municipalities ids are handled properly, the aggfunc='sum' is used. .reset_index() allows to turn the index into the period column.

    6th step

    final_df = final_df.rename_axis(None, axis=1)
    

    Drops the new_node_id name for the index since it is not this anymore.

    Verification

    With the example data you gave, i get :

    municip = {
        "muni_id": [1401, 1402, 1407, 1415, 1419, 1480, 1480, 1427, 1484],
        "muni_name": ["Har", "Par", "Ock", "Ste", "Tjo", "Gbg", "Gbg", "Sot", "Lys"],
        "new_muni_id": [1401, 1402, 1480, 1415, 1415, 1480, 1480, 1484, 1484],
        "new_muni_name": ["Har", "Har", "Gbg", "Ste", "Ste", "Gbg", "Gbg", "Lys", "Lys"],
        "new_node_id": ["HAR1", "PAR1", "GBG2", "STE1", "STE1", "GBG1", "GBG2", "LYS1", "LYS1"]
    }
    df_1 = pd.DataFrame(municip)
    
    demand = {
        "period": [1, 2, 3, 4, 5],
        1401: [2, 4, 4, 1, 2],
        1402: [1, 1, 3, 3, 5],
        1407: [2, 4, 4, 1, 2],
        1415: [1, 1, 3, 3, 5],
        1419: [1, 1, 3, 3, 5],
        1480: [1, 1, 3, 3, 5],
        1427: [2, 4, 4, 1, 2],
        1484: [1, 2, 3, 4, 5]
    }
    df_2 = pd.DataFrame(demand)
    
    
    profiling(df_1, df_2)
    
    >>>
       period  GBG1  GBG2  HAR1  LYS1  PAR1  STE1
    0       1   0.5   2.5   2.0   3.0   1.0   2.0
    1       2   0.5   4.5   4.0   6.0   1.0   2.0
    2       3   1.5   5.5   4.0   7.0   3.0   6.0
    3       4   1.5   2.5   1.0   5.0   3.0   6.0
    4       5   2.5   4.5   2.0   7.0   5.0  10.0