Search code examples
pythondataframe

How to Groupby and assign Series Values to each row?


I have the following data-frame (read from a csv file):

     my_df:
     my_date    my_id  values  key   factor
     1/1/2024   _One    123    key1   .56
     1/7/2024   _One    567    key1   .75
     1/14/2024  _One    100    key1   .81
     1/14/2024  _One    100    key2   .44
     1/1/2024   _Two    150    key3   .91
     1/7/2024   _Two    130    key3   .88
     1/1/2024   _Three  200    key4    0
     1/1/2024   _Three  200    key5    .45

So, there is an overlap of certain dates for two or more keys belonging to the same 'id'. What i want my data-frame to look like is as follows, that is, i need to calculate the allocated values based on the factor weights. Note: the calculated weight is obtained by dividing the factor by the sum of the factors in the overlapping periods. Say,

     my_df:
     my_date    my_id   values  key   factor    weights allocated_values
     1/1/2024   _One    123     key1    0.56    1       123
     1/7/2024   _One    500     key1    0.75    1       500
     1/14/2024  _One    100     key1    0.81    0.648   64.8
     1/14/2024  _One    100     key2    0.44    0.352   35.2
     1/1/2024   _Two    160     key3    0.91    1       160
     1/7/2024   _Two    130     key3    0.88    1       130
     1/1/2024   _Three  200     key4    0      0.50     100
     1/1/2024   _Three  200     key5    0.45   0.50     100
     

To achieve the above result, i am doing the following group by:

     for name, group in my_df.groupby('my_id'):
          for name1, group1 in group.groupby('key'):
              factors = group1['factor']
              weight = factors['factor']/factors.sum() if factors.sum() != 0 | (factors==0).any() else 1/len(factors)
              #what i tried- approach1
              group['weights'] = weight #doesn't work
              #what i tried next
              my_df['weights'] = my_df.update(group) #doesn't work
              

I am so tired now, unable to think any further. So posting it here for any help/guidance.

Would much appreciate any hints.


Solution

  • You could try to use .groupby (over my_id and my_date) in combination with .transform to get that done:

    df["allocated_values"] = (
        df["values"] * df["factor"]
            / df.groupby(["my_id", "my_date"])["factor"].transform("sum")
    )
    

    Result for the sample:

         my_date my_id  values   key  factor  allocated_values
    0   1/1/2024  _One     123  key1    0.56             123.0
    1   1/7/2024  _One     567  key1    0.75             567.0
    2  1/14/2024  _One     100  key1    0.81              64.8
    3  1/14/2024  _One     100  key2    0.44              35.2
    4   1/1/2024  _Two     150  key3    0.91             150.0
    5   1/7/2024  _Two     130  key3    0.88             130.0
    

    If you want to keep the weights then you could try:

    df = (
        df.assign(weights=
            df["factor"] / df.groupby(["my_id", "my_date"])["factor"].transform("sum"))
        .assign(allocated_values=lambda df: df["values"] * df["weights"])
    )
    

    Result for the sample:

         my_date my_id  values   key  factor  allocated_values  weights
    0   1/1/2024  _One     123  key1    0.56             123.0    1.000
    1   1/7/2024  _One     567  key1    0.75             567.0    1.000
    2  1/14/2024  _One     100  key1    0.81              64.8    0.648
    3  1/14/2024  _One     100  key2    0.44              35.2    0.352
    4   1/1/2024  _Two     150  key3    0.91             150.0    1.000
    5   1/7/2024  _Two     130  key3    0.88             130.0    1.000
    

    Regarding the modification: I'm not sure that I completly understand the specification, but you could try to modify the factors of those groups that contain a 0 by setting it to 1, so the resprective weights would be uniform (1 / length of group):

    df = (
        df.assign(mask=df["factor"].ne(0))
        .assign(mask=lambda df: df.groupby(["my_id", "my_date"])["mask"].transform("all"))
        .assign(factor_alt=lambda df: df["factor"].where(df["mask"], 1))
        .assign(weights=lambda df: df["factor_alt"]
            / df.groupby(["my_id", "my_date"])["factor_alt"].transform("sum"))
        .assign(allocated_values=lambda df: df["values"] * df["weights"])
        .drop(columns=["mask", "factor_alt"])
    )
    

    The first 2 steps identify the groups in which a 0 occurs, and the 3. step adjusts the factors accordingly. Then it's the same as before, just with the adjusted factors.

    I'm getting the following result with the modified input:

         my_date   my_id  values   key  factor  allocated_values  weights
    0   1/1/2024    _One     123  key1    0.56             123.0    1.000
    1   1/7/2024    _One     567  key1    0.75             567.0    1.000
    2  1/14/2024    _One     100  key1    0.81              64.8    0.648
    3  1/14/2024    _One     100  key2    0.44              35.2    0.352
    4   1/1/2024    _Two     150  key3    0.91             150.0    1.000
    5   1/7/2024    _Two     130  key3    0.88             130.0    1.000
    6   1/1/2024  _Three     200  key4    0.00             100.0    0.500
    7   1/1/2024  _Three     200  key5    0.45             100.0    0.500