Combining OHLCV price series of multiple trading pairs by volume-weighting

I have multiple trading pairs for the same asset in the familiar OHLCV format. Data is indexed by (pair, timestamp) tuples and contains OHLCV columns.

How can I calculate volume-weighted average prices and total volumes across all trading pairs in a single DataFrame?

Solution

To get the maximum performance you can do

Calculate first the total volume for each timestamp
Calculate weight for each pair entry in a timestamp based on its volume
Based on weights, calculate the resulting open/high/low/close

Example:

def calculate_volume_weighted_ohlcv(df: pd.DataFrame) -> pd.DataFrame:
    """Calculate volume weighted average prices (vwap) on OHLCV.

    - Assume input has multiple entries for each timestamp (index) separated by column "pair_id"

    - open, high, low, close columns are weighted by volume column

    - volume is the sum of all volume by timestamp

    - liquidity, if in the columns, is the sum of all liquidity by timestamp

    .. code-block:: text

                    pair_id  open  high  low  close  volume  liquidity
        timestamp
        2020-01-01        1   100   100  100    100     500         10
        2020-02-02        1   100   100  100    100     500         10
        2020-01-01        2   110   110  110    110     250         20
        2020-02-02        2   110   110  110    110     250         20
        2020-02-02        3   200   200  200    200    1000         30


    :param df:
        Must have columns pair_id, timestamp, open, high, low, close, volume and optionally liquidity

    :return:
        Aggregated open, high, low, close, volume, liquidity
    """

    assert isinstance(df.index, pd.DatetimeIndex)

    timestamp_agg = df.groupby(level='timestamp')

    # Calculate 0..1 weight for each (pair_id, timestamp) combo
    df["total_volume_in_timestamp"] = timestamp_agg.agg("volume").sum()
    df["weight"] = df["volume"] / df["total_volume_in_timestamp"]

    result_df = pd.DataFrame()
    df["open_weighted"] = (df["open"] * df["weight"])
    df["high_weighted"] = (df["high"] * df["weight"])
    df["low_weighted"] = (df["low"] * df["weight"])
    df["close_weighted"] = (df["close"] * df["weight"])

    grouped_2 = df.groupby("timestamp")
    result_df["open"] = grouped_2["open_weighted"].sum()
    result_df["high"] = grouped_2["high_weighted"].sum()
    result_df["low"] = grouped_2["low_weighted"].sum()
    result_df["close"] = grouped_2["close_weighted"].sum()
    result_df["volume"] = grouped_2["volume"].sum()
    if "liquidity" in df.columns:
        result_df["liquidity"] = grouped_2["liquidity"].sum()
    return result_df