I have multiple trading pairs for the same asset in the familiar OHLCV format. Data is indexed by (pair, timestamp) tuples and contains OHLCV columns.
How can I calculate volume-weighted average prices and total volumes across all trading pairs in a single DataFrame?
To get the maximum performance you can do
Example:
def calculate_volume_weighted_ohlcv(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate volume weighted average prices (vwap) on OHLCV.
- Assume input has multiple entries for each timestamp (index) separated by column "pair_id"
- open, high, low, close columns are weighted by volume column
- volume is the sum of all volume by timestamp
- liquidity, if in the columns, is the sum of all liquidity by timestamp
.. code-block:: text
pair_id open high low close volume liquidity
timestamp
2020-01-01 1 100 100 100 100 500 10
2020-02-02 1 100 100 100 100 500 10
2020-01-01 2 110 110 110 110 250 20
2020-02-02 2 110 110 110 110 250 20
2020-02-02 3 200 200 200 200 1000 30
:param df:
Must have columns pair_id, timestamp, open, high, low, close, volume and optionally liquidity
:return:
Aggregated open, high, low, close, volume, liquidity
"""
assert isinstance(df.index, pd.DatetimeIndex)
timestamp_agg = df.groupby(level='timestamp')
# Calculate 0..1 weight for each (pair_id, timestamp) combo
df["total_volume_in_timestamp"] = timestamp_agg.agg("volume").sum()
df["weight"] = df["volume"] / df["total_volume_in_timestamp"]
result_df = pd.DataFrame()
df["open_weighted"] = (df["open"] * df["weight"])
df["high_weighted"] = (df["high"] * df["weight"])
df["low_weighted"] = (df["low"] * df["weight"])
df["close_weighted"] = (df["close"] * df["weight"])
grouped_2 = df.groupby("timestamp")
result_df["open"] = grouped_2["open_weighted"].sum()
result_df["high"] = grouped_2["high_weighted"].sum()
result_df["low"] = grouped_2["low_weighted"].sum()
result_df["close"] = grouped_2["close_weighted"].sum()
result_df["volume"] = grouped_2["volume"].sum()
if "liquidity" in df.columns:
result_df["liquidity"] = grouped_2["liquidity"].sum()
return result_df