Search code examples
pythonexcelpandasdataframeseries

Python Dataframe calculations based on another dataframe


I've 2 data frames.

Datraframe1 Dataframe2

I'd like to populate DF1 columns with data from DF2 using below calculations. Prev_close column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get previous row "CLOSE" column value and populate in DF1 "prev_close" column.

Below rest of columns are in % format. D1L column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get "LOW" column value and populate in DF1 "D1L" column as a formula ("low" - "prev_close") / (prev_close)

D1H column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get "HIGH" column value and populate in DF1 "D1H" column as a formula ("high" - "prev_close" ) / (prev_close)

D2L column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get 2nd row after lookup row (current row + 1) "LOW" column value and populate in DF1 "D2L" column as a formula ("low" - "prev_close") / (prev_close)

D2H column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get 2nd row after lookup row (current row + 1) "HIGH" column value and populate in DF1 "D2H" column as a formula ("high" - "prev_close" ) / (prev_close)

D3L column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get 3rd row after lookup row (current row + 2) "LOW" column value and populate in DF1 "D3L" column as a formula ("low" - "prev_close") / (prev_close)

D3H column : for each (symbol, date) as key from df1, lookup (symbol,date) in df2 and get 3rd row after lookup row (current row + 2) "HIGH" column value and populate in DF1 "D3H" column as a formula ("high" - "prev_close" ) / (prev_close)

in the end, i'm expecting output as follows and appreciate any help or code you can provide. Expected output of Dataframe1


Solution

  • I've tried to break it up step-by-step so its easier to follow.

    df1 and df2 are what you provided.

    • The groupby function makes sure we don't mix up each symbol.
    • The shift function fetches the previous or next row.
    • The apply section is for converting the result into a blank if invalid, or a percentage to 3 decimal places if valid. {:.3%} You can change this number to change the decimal places required.
    • The merge function is the final step to only get the rows and calculations that you need.
    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'Symbol': ['SPX', 'SPX', 'SPX', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL'],
                       'open': [2665.69, 2689.6, 2692.165, 123.5402, 128.1923, 124.5984, 121.5435, 120.7749, 120.7249],
                       'high': [2712.81, 2722.22, 2712.31, 127.7131, 128.5017, 125.4968, 123.3904, 121.7332, 120.7948],
                       'low': [2656.505, 2667.57, 2668.65, 122.5818, 124.798, 121.6334, 118.4189, 117.3706, 116.0129],
                       'close': [2697.82, 2683.36, 2695.03, 127.5733, 124.9078, 121.853, 119.9263, 121.2141, 116.1627],
                       'volume': [1483784, 1231996, 1198337, 1.16E+08, 1.02E+08, 1.13E+08, 1.78E+08, 1.54E+08, 1.54E+08],
                       'date': [20220301, 20220302, 20220303, 20220301, 20220302, 20220303, 20220304, 20220305, 20220306]})
    
    df2 = pd.DataFrame({'Symbol': ['SPX', 'AAPL'], 'date': [20220302, 20220304]})
    
    dfcal = df1
    dfcal['Prev_Close'] = df1.groupby(['Symbol'])['close'].shift(1)
    dfcal['D1L'] = ((dfcal.groupby(['Symbol'])['low'].shift(0) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D1H'] = ((dfcal.groupby(['Symbol'])['high'].shift(0) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D2L'] = ((dfcal.groupby(['Symbol'])['low'].shift(-1) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D2H'] = ((dfcal.groupby(['Symbol'])['high'].shift(-1) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D3L'] = ((dfcal.groupby(['Symbol'])['low'].shift(-2) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D3H'] = ((dfcal.groupby(['Symbol'])['high'].shift(-2) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D4L'] = ((dfcal.groupby(['Symbol'])['low'].shift(-3) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    dfcal['D4H'] = ((dfcal.groupby(['Symbol'])['high'].shift(-3) - dfcal['Prev_Close'] ) / dfcal['Prev_Close']).apply(lambda x: '' if np.isnan(x) else "{:.3%}".format(x))
    
    
    dffinal = pd.merge(df2, dfcal[['Symbol', 'date', 'Prev_Close', 'D1L', 'D1H', 'D2L', 'D2H', 'D3L', 'D3H', 'D4L', 'D4H']], on = ['Symbol', 'date'], how = 'left')
    dffinal