EDITED FOR CLARITY
I am working on regional electricity system model currently I have two dataframes, one that has information of municipality and energy node, the other is the energy demand over time in a municipality, illustrated below (not the actual df that I use)
municip = {
"muni_id": [1401, 1402, 1407, 1415, 1419, 1480, 1480, 1427, 1484],
"muni_name": ["Har", "Par", "Ock", "Ste", "Tjo", "Gbg", "Gbg", "Sot", "Lys"],
"new_muni_id": [1401, 1402, 1480, 1415, 1415, 1480, 1480, 1484, 1484],
"new_muni_name": ["Har", "Har", "Gbg", "Ste", "Ste", "Gbg", "Gbg", "Lys", "Lys"],
"new_node_id": ["HAR1", "PAR1", "GBG2", "STE1", "STE1", "GBG1", "GBG2", "LYS1", "LYS1"]
}
df_1 = pd.DataFrame(municip)
demand = {
"period": [1, 2, 3, 4, 5],
1401: [2, 4, 4, 1, 2],
1402: [1, 1, 3, 3, 5],
1407: [2, 4, 4, 1, 2],
1415: [1, 1, 3, 3, 5],
1419: [1, 1, 3, 3, 5],
1480: [1, 1, 3, 3, 5],
1427: [2, 4, 4, 1, 2],
1484: [1, 2, 3, 4, 5]
}
df_2 = pd.DataFrame(demand)
there are "old" and "new" muni_id and name because some municipality may not have node, therefore I aggregated the demand to the nearest node. df_2 is demand in the whole municipality, with the number in columns indicate muni_id.
my intention is to assign the municipal demand into energy node in a uniform distribution manner (as a start for simplicity)
I tried the following to achieve:
create a new dataframe with columns of period and list of nodes
assign demand on a municipality that has same old and new id/name, divided with the number of nodes within that same municipality, for example above Gbg has 2 nodes (GBG1 and GBG2), therefore the demand on GBG1 and GBG2 will be demand of 1480 (Gbg) divided by two.
for municipality that have different old and new id/name, the demand of old muni id will be added to the demand of the assigned node based on the previous step. for example in df_1 muni_id 1407 (Ock) is assigned to GBG2 node, therefore the demand of GBG2 in the first period will be 2.5, where 0.5 is from the previous step (Gbg demand / 2), 2 from the current step (demand of Ock)
import pandas as pd
def profiling(df_1, df_2):
# create empty df
nodes = df_1["new_node_id"].unique()
node_df = pd.DataFrame(columns=["period"] + list(nodes))
node_df["period"] = df_2["period"]
# loop based on node id
for node in nodes:
node_info = df_1[df_1["new_node_id"] == node]
muni_id = node_info["muni_id"].values[0]
new_muni_id = node_info["new_muni_id"].values[0]
# assign demand to all nodes that has same old/new id
if muni_id == new_muni_id :
cumul_node = len(df_1[df_1["muni_id"] == muni_id])
node_df[node] = df_2[muni_id] / cumul_node
# assign demand to node for muni that has different old/new id
else:
cumul_node = len(df_1[df_1["new_muni_id"] == new_muni_id])
add_demand = df_2[muni_id] / cumul_node
if node in node_df:
node_df[node] += add_demand
else:
node_df[node] = add_demand
return node_df
However, this results in some nodes return a NaN values, specifically municipalities that have different old/new id, like below table
period | HAR1 | PAR1 | GBG2 | STE1 | GBG1 | LYS1 |
---|---|---|---|---|---|---|
1 | 2 | 1 | NaN | 1 | 0.5 | NaN |
2 | 4 | 1 | NaN | 1 | 0.5 | NaN |
3 | 4 | 3 | NaN | 3 | 1.5 | NaN |
4 | 1 | 3 | NaN | 3 | 1.5 | NaN |
5 | 2 | 5 | NaN | 5 | 2.5 | NaN |
from my logic, hourly demand of GBG2 should be half of 1480 (Gbg muni) + 1407 (Ock muni), since Ock is assigned GBG2 as new node, i.e. 2.5 in the first period. The same goes for LYS1 node, where it will be the total demand of 1427 (Sot) and 1484 (Lys).
So I expect to get something like this
period | HAR1 | PAR1 | GBG2 | STE1 | GBG1 | LYS1 |
---|---|---|---|---|---|---|
1 | 2 | 1 | 2.5 | 1 | 0.5 | 3 |
2 | 4 | 1 | 4.5 | 1 | 0.5 | 6 |
3 | 4 | 3 | 5.5 | 3 | 1.5 | 7 |
4 | 1 | 3 | 2.5 | 3 | 1.5 | 5 |
5 | 2 | 5 | 4.5 | 5 | 2.5 | 7 |
Could someone improve my function, or perhaps there is a flaw in my logic of adding the values?
Sorry for the lengthy question and thanks for the help in advance!
I got lost trying to solve it from the piece of code you provided with for loops, so here is my approach :
def profiling(df_1, df_2):
df_1['num_nodes'] = df_1.groupby('muni_id')['new_node_id'].transform('count')
df_2_melted = df_2.melt(id_vars=['period'], var_name='muni_id', value_name='demand')
merged_df = df_2_melted.merge(df_1, on='muni_id', how='left')
merged_df['demand'] = merged_df['demand'] / merged_df['num_nodes']
final_df = merged_df.pivot_table(index='period', columns='new_node_id', values='demand', aggfunc='sum').reset_index()
final_df = final_df.rename_axis(None, axis=1)
return final_df
If we break down the code :
df_1['num_nodes'] = df_1.groupby('muni_id')['new_node_id'].transform('count')
This counts the number of nodes in each group of muni_id
nodes and return a Series
which is assigned to a num_nodes
column.
df_2_melted = df_2.melt(id_vars=['period'], var_name='muni_id', value_name='demand')
The use of melt
creates a new DataFrame
where a column named muni_id
is added to switch from a n_period x n_muni
table to a single column of values for demand on the period for a given node. Meaning that each row corresponds to a specific period
and muni_id
pair with its associated demand.
merged_df = df_2_melted.merge(df_1, on='muni_id', how='left')
This merges df_2_melted
with df_1
using the muni_id
column. This adds all the information from df_1
into df_2_melted
for each period
, muni_id
and demand
.
merged_df['demand'] = merged_df['demand'] / merged_df['num_nodes']
This divides the demand by the number of nodes found at step 1.
final_df = merged_df.pivot_table(index='period', columns='new_node_id', values='demand', aggfunc='sum').reset_index()
This creates a pivot table using period
as rows and node_id
as columns. The table is filled with values from the demand
column and to ensure that old and new municipalities ids are handled properly, the aggfunc='sum'
is used. .reset_index()
allows to turn the index into the period
column.
final_df = final_df.rename_axis(None, axis=1)
Drops the new_node_id
name for the index since it is not this anymore.
With the example data you gave, i get :
municip = {
"muni_id": [1401, 1402, 1407, 1415, 1419, 1480, 1480, 1427, 1484],
"muni_name": ["Har", "Par", "Ock", "Ste", "Tjo", "Gbg", "Gbg", "Sot", "Lys"],
"new_muni_id": [1401, 1402, 1480, 1415, 1415, 1480, 1480, 1484, 1484],
"new_muni_name": ["Har", "Har", "Gbg", "Ste", "Ste", "Gbg", "Gbg", "Lys", "Lys"],
"new_node_id": ["HAR1", "PAR1", "GBG2", "STE1", "STE1", "GBG1", "GBG2", "LYS1", "LYS1"]
}
df_1 = pd.DataFrame(municip)
demand = {
"period": [1, 2, 3, 4, 5],
1401: [2, 4, 4, 1, 2],
1402: [1, 1, 3, 3, 5],
1407: [2, 4, 4, 1, 2],
1415: [1, 1, 3, 3, 5],
1419: [1, 1, 3, 3, 5],
1480: [1, 1, 3, 3, 5],
1427: [2, 4, 4, 1, 2],
1484: [1, 2, 3, 4, 5]
}
df_2 = pd.DataFrame(demand)
profiling(df_1, df_2)
>>>
period GBG1 GBG2 HAR1 LYS1 PAR1 STE1
0 1 0.5 2.5 2.0 3.0 1.0 2.0
1 2 0.5 4.5 4.0 6.0 1.0 2.0
2 3 1.5 5.5 4.0 7.0 3.0 6.0
3 4 1.5 2.5 1.0 5.0 3.0 6.0
4 5 2.5 4.5 2.0 7.0 5.0 10.0