Search code examples
pythonpandasdataframedata-processing

Distribute a row over other rows sharing same key


I have a dataframe that looks like this :

+------+------------+-------+--------------+
| name |    date    | value | replacement |
+------+------------+-------+--------------+
| A    | 20/11/2016 |    10 | NaN          |
| C    | 20/11/2016 |    8  | [A,B]        |
| B    | 20/11/2016 |    12 | NaN          |
| E    | 25/12/2016 |    16 | NaN          |
| F    | 25/12/2016 |    18 | NaN          |
| D    | 25/12/2016 |    11 | [E,F]        |
+------+------------+-------+--------------+

What I would like to do :
For each row that has a list of names in the column 'replacement', I would like its 'value' to be distributed equally over the rows containing those replacements + for the same date.
For the previous example, the output would look like this :

+------+------------+-------+------------------+
| name |    date    | value | additional value |
+------+------------+-------+------------------+
| A    | 20/11/2016 |    10 |                4 |
| B    | 20/11/2016 |    12 |                4 |
| A    | 25/12/2016 |    16 |              5.5 |
| B    | 25/12/2016 |    18 |              5.5 |
+------+------------+-------+------------------+

I managed to find a way to perform the distribution directly without creating a new column by splitting those rows and grouping by name + date but 1/ It was too slow + 2/I do need to create that additional column and can't find a way to do so.


Solution

  • Idea is create new column by lenghts of replacement lists with Series.str.len and then DataFrame.explode (pandas 0.25+) them to scalars. Divide columns value by new and merge by original with another columns names for add original columns:

    df1 = df.assign(new=df['replacement'].str.len()).explode('replacement')
    df1['new'] = df1['value'].div(df1['new'])
    
    df1 = df1[['name','date','value']].merge(df1[['replacement','date','new']],
                                        left_on=['name','date'],
                                        right_on=['replacement','date'])
    df1['replacement'] = df1.pop('new')
    print (df1)
      name        date  value  replacement
    0    A  20/11/2016     10          4.0
    1    B  20/11/2016     12          4.0
    2    A  25/12/2016     16          5.5
    3    B  25/12/2016     18          5.5
    

    Similar solution with droping instead selecting:

    df1 = df.assign(new=df['replacement'].str.len()).explode('replacement')
    df1['new'] = df1['value'].div(df1['new'])
    
    df1 = df1.drop(['replacement','new'],1).merge(df1.drop(['name','value'],1),
                                            left_on=['name','date'],
                                            right_on=['replacement','date'])
    df1['replacement'] = df1.pop('new')
    print (df1)
      name        date  value  replacement
    0    A  20/11/2016     10          4.0
    1    B  20/11/2016     12          4.0
    2    A  25/12/2016     16          5.5
    3    B  25/12/2016     18          5.5