Search code examples
pythonpandasgroupingconsolidation

Grouping data by multiple criteria in Python


I think I have a quick question, but I didn't find a way to google it in simple words.

I've got a raw dataset like this:

 Number of account     Value
      123               100
      456               300
      789               400
      910               100
      674               250

And I've got a methodological table to consolidate this raw data into something useful. It looks like:

 Variable              Number of account
    "a"                  123, 456, 910
    "b"                    789,674

So, in the end I would like to get a table like this:

 Variable              Number of account
    "a"                  Sum of values for(123, 456, 910)
    "b"                  Sum of values for(789,674)

My initial idea is to do something like: For each row in methodological table, For each Number of account in methodological table, Sum values in raw data.

Two questions:

  1. What is the best way to consolidate it?
  2. What if in methodological table number of accounts are comma-delimited strings? ("123,456,910"). Can I store multiple numbers in one cell in pandas DataFrame

Solution

  • Assuming I have data in two dataframes:

    df is :

    Number_of_account     Value
          123               100
          456               300
          789               400
          910               100
          674               250
    

    and table_2 is:

    Variable              Number_of_account
        "a"                  123,456,910
        "b"                    789,674
    

    First, I'll create a lookup table out of table2:

    lookup_table = pd.concat([pd.Series(row['Variable'], row['Number_of_account'].split(','))              
                             for _, row in table_2.iterrows()]).reset_index()
    lookup_table.columns = ["Number_of_account", "variable"]
    lookup_table.Number_of_account = pd.to_numeric(lookup_table.Number_of_account)
    

    The result is:

       Number_of_account variable
    0                123        a
    1                456        a
    2                910        a
    3                789        b
    4                674        b
    

    Then, I merge the main dataframe (df) with the lookup table, and use groupby to calculate the sum of the values.

    df = pd.merge(df, lookup_table, on="Number_of_account")
    df.groupby("variable")["Value"].sum()
    

    The result is:

    variable
    a    500
    b    650