Search code examples
pythonpandasdataframesubtraction

Python long format: subtract selection of rows


All,

I have the following long format dataframe:

df = pd.DataFrame({'date': ["2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-01","2020-01-01","2020-01-02","2020-01-02"], 'asset': ["x", "x","x", "x","y","y","y","y"], 'type': ["price", "spread","price","spread","price", "spread","price","spread"], 'value': ["1.5", "0.01","1.6", "0.01","1","0.08","1.2","0.09"]})

which looks like this:

         date asset    type value
0  2020-01-01     x   price   1.5
1  2020-01-01     x  spread  0.01
2  2020-01-02     x   price   1.6
3  2020-01-02     x  spread  0.01
4  2020-01-01     y   price     1
5  2020-01-01     y  spread  0.08
6  2020-01-02     y   price   1.2
7  2020-01-02     y  spread  0.09

I want to subtract the price of y from the price of x and maintain the same datastructure, result should should look like this:

         date    asset       type value
0  2020-01-01        x      price   1.5
1  2020-01-01        x     spread  0.01
2  2020-01-02        x      price   1.6
3  2020-01-02        x     spread  0.01
4  2020-01-01        y      price     1
5  2020-01-01        y     spread  0.08
6  2020-01-02        y      price   1.2
7  2020-01-02        y     spread  0.09
8  2020-01-01  x_min_y  pricediff   0.5
9  2020-01-02  x_min_y  pricediff   0.4

I would like to use the assign() function of pandas to create this but I am not sure how to do this.

Thanks in advance!


Solution

  • Use:

    m = df['type'].eq('price') & df['asset'].isin(['x', 'y'])
    d = df[m].pivot('date', 'asset', 'value').astype(float)
    
    d = pd.concat(
        [df, d['x'].sub(d['y']).reset_index(name='value').assign(
            asset='x_min_y', type='pricediff')],
        ignore_index=True)
    

    Details:

    Create a boolean mask m to filter the rows where type is price and asset is in x, y and use DataFrame.pivot to reshape the dataframe:

    print(d) # pivoted dataframe
    asset         x    y
    date                
    2020-01-01  1.5  1.0
    2020-01-02  1.6  1.2
    

    Use Series.sub to subtract column x from y in the pivoted dataframe and assign the columns asset and type, then use pd.concat to concat this pivoted dataframe with the original dataframe df.

    print(d)
             date    asset       type value
    0  2020-01-01        x      price   1.5
    1  2020-01-01        x     spread  0.01
    2  2020-01-02        x      price   1.6
    3  2020-01-02        x     spread  0.01
    4  2020-01-01        y      price     1
    5  2020-01-01        y     spread  0.08
    6  2020-01-02        y      price   1.2
    7  2020-01-02        y     spread  0.09
    8  2020-01-01  x_min_y  pricediff   0.5
    9  2020-01-02  x_min_y  pricediff   0.4