I think I have a quick question, but I didn't find a way to google it in simple words.
I've got a raw dataset like this:
Number of account Value
123 100
456 300
789 400
910 100
674 250
And I've got a methodological table to consolidate this raw data into something useful. It looks like:
Variable Number of account
"a" 123, 456, 910
"b" 789,674
So, in the end I would like to get a table like this:
Variable Number of account
"a" Sum of values for(123, 456, 910)
"b" Sum of values for(789,674)
My initial idea is to do something like: For each row in methodological table, For each Number of account in methodological table, Sum values in raw data.
Two questions:
Assuming I have data in two dataframes:
df
is :
Number_of_account Value
123 100
456 300
789 400
910 100
674 250
and table_2
is:
Variable Number_of_account
"a" 123,456,910
"b" 789,674
First, I'll create a lookup table out of table2:
lookup_table = pd.concat([pd.Series(row['Variable'], row['Number_of_account'].split(','))
for _, row in table_2.iterrows()]).reset_index()
lookup_table.columns = ["Number_of_account", "variable"]
lookup_table.Number_of_account = pd.to_numeric(lookup_table.Number_of_account)
The result is:
Number_of_account variable
0 123 a
1 456 a
2 910 a
3 789 b
4 674 b
Then, I merge the main dataframe (df
) with the lookup table, and use groupby
to calculate the sum of the values.
df = pd.merge(df, lookup_table, on="Number_of_account")
df.groupby("variable")["Value"].sum()
The result is:
variable
a 500
b 650