Say, I have a polars dataframe as below:
df = pl.DataFrame(
{
"date": ["d1", "d1", "d1", "d1", "d1", "d1"],
"ticker": ["t1", "t1", "t1", "t1", "t1", "t1"],
"positionType": ["Long", "Long", "Short", "Short", "Long", "Short"],
"predictionType": ["p1", "s1", "p1", "s1", "p1", "p1"],
"holdingBars": [1, 1, 1, 1, 2, 2],
"res": [1, 2, 3, 4, 5, 6],
}
)
date | ticker | positionType | predictionType | holdingBars | res |
---|---|---|---|---|---|
0 | d1 | t1 | Long | p1 | 1 |
1 | d1 | t1 | Long | s1 | 1 |
2 | d1 | t1 | Short | p1 | 1 |
3 | d1 | t1 | Short | s1 | 1 |
4 | d1 | t1 | Long | p1 | 2 |
5 | d1 | t1 | Short | p1 | 2 |
I have tried pivoting it as below:
df.pivot(
values="res", index=["date", "ticker", "predictionType"], columns=["positionType", "holdingBars"], aggregate_function="first")
Output:
date | ticker | predictionType | Long | Short | 1 | 2 |
---|---|---|---|---|---|---|
d1 | t1 | p1 | 1 | 3 | 1 | 5 |
d1 | t1 | s1 | 2 | 4 | 2 | null |
date | ticker | predictionType | Long_1 | Short_1 | Long_2 | Short_2 |
---|---|---|---|---|---|---|
d1 | t1 | p1 | 1 | 3 | 5 | 6 |
d1 | t1 | s1 | 2 | 4 | null | null |
date | ticker | predictionType | Long_1 | Short_1 | Long_2 | Short_2 | Long_1 / Short_1 | Long_2 / Short_2 |
---|---|---|---|---|---|---|---|---|
d1 | t1 | p1 | 1 | 3 | 5 | 6 | 0.3 | 0.83 |
d1 | t1 | s1 | 2 | 4 | null | null | 0.5 | NaN |
Please note that this is a subset of data, it can have multiple tickers across multiple dates and res can be across multiple holding periods.
Any idea how an achieve these, please?
ta!
You can create the column name before pivoting:
(
df.with_columns(
columns=pl.col('positionType') + '_' + pl.col('holdingBars').cast(str)
).pivot(
values='res',
index=['date', 'ticker', 'predictionType'],
columns='columns',
aggregate_function='first',
)
)
shape: (2, 7)
┌──────┬────────┬────────────────┬────────┬─────────┬────────┬─────────┐
│ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Short_1 ┆ Long_2 ┆ Short_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪════════╪════════════════╪════════╪═════════╪════════╪═════════╡
│ d1 ┆ t1 ┆ p1 ┆ 1 ┆ 3 ┆ 5 ┆ 6 │
│ d1 ┆ t1 ┆ s1 ┆ 2 ┆ 4 ┆ null ┆ null │
└──────┴────────┴────────────────┴────────┴─────────┴────────┴─────────┘
As your Long/Short items are pairs, and in order - you could groupby and use first/last to isolate them:
(
df.with_columns(
(pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
.alias('predId')
)
.groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
.agg(
pl.first('res').alias('Long'),
pl.last('res').alias('Short'),
(pl.first('res') / pl.last('res')).alias('Long/Short'),
)
)
shape: (3, 7)
┌──────┬────────┬────────────────┬────────┬──────┬───────┬────────────┐
│ date ┆ ticker ┆ predictionType ┆ predId ┆ Long ┆ Short ┆ Long/Short │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ u32 ┆ i64 ┆ i64 ┆ f64 │
╞══════╪════════╪════════════════╪════════╪══════╪═══════╪════════════╡
│ d1 ┆ t1 ┆ p1 ┆ 1 ┆ 1 ┆ 3 ┆ 0.333333 │
│ d1 ┆ t1 ┆ s1 ┆ 1 ┆ 2 ┆ 4 ┆ 0.5 │
│ d1 ┆ t1 ┆ p1 ┆ 2 ┆ 5 ┆ 6 ┆ 0.833333 │
└──────┴────────┴────────────────┴────────┴──────┴───────┴────────────┘
You can then .melt()
the ['date', 'ticker', 'predictionType', 'predId']
columns and pivot.
(
df.with_columns(
(pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
.alias('predId')
)
.groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
.agg(
pl.first('res').alias('Long'),
pl.last('res').alias('Short'),
(pl.first('res') / pl.last('res')).alias('Long/Short')
)
.melt(
id_vars = ['date', 'ticker', 'predictionType', 'predId'],
variable_name = 'column'
)
.with_columns(
pl.col('column') + '_' + pl.col('predId').cast(str)
).pivot(
values = 'value',
index = ['date', 'ticker', 'predictionType'],
columns = 'column',
aggregate_function = None
)
)
shape: (2, 9)
┌──────┬────────┬────────────────┬────────┬────────┬─────────┬─────────┬──────────────┬──────────────┐
│ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Long_2 ┆ Short_1 ┆ Short_2 ┆ Long/Short_1 ┆ Long/Short_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪════════╪════════════════╪════════╪════════╪═════════╪═════════╪══════════════╪══════════════╡
│ d1 ┆ t1 ┆ p1 ┆ 1.0 ┆ 5.0 ┆ 3.0 ┆ 6.0 ┆ 0.333333 ┆ 0.833333 │
│ d1 ┆ t1 ┆ s1 ┆ 2.0 ┆ null ┆ 4.0 ┆ null ┆ 0.5 ┆ null │
└──────┴────────┴────────────────┴────────┴────────┴─────────┴─────────┴──────────────┴──────────────┘
The use of .melt
causes the upcasting to floats which may or may not be an issue.
It's also possible to "pivot" in the following manner:
pivots = 'Long', 'Short', 'Long/Short'
(
df.with_columns(
(pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
.alias('predId')
)
.groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
.agg(
pl.first('res').alias('Long'),
pl.last('res').alias('Short'),
(pl.first('res') / pl.last('res')).alias('Long/Short')
)
.groupby('date', 'ticker', 'predictionType')
.agg(pivots)
.with_columns(
pl.col(col).list.to_struct(
n_field_strategy = 'max_width',
fields = lambda idx, col=col: f'{col}_{idx + 1}'
)
for col in pivots
)
.unnest(pivots)
)
shape: (2, 9)
┌──────┬────────┬────────────────┬────────┬────────┬─────────┬─────────┬──────────────┬──────────────┐
│ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Long_2 ┆ Short_1 ┆ Short_2 ┆ Long/Short_1 ┆ Long/Short_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞══════╪════════╪════════════════╪════════╪════════╪═════════╪═════════╪══════════════╪══════════════╡
│ d1 ┆ t1 ┆ s1 ┆ 2 ┆ null ┆ 4 ┆ null ┆ 0.5 ┆ null │
│ d1 ┆ t1 ┆ p1 ┆ 1 ┆ 5 ┆ 3 ┆ 6 ┆ 0.333333 ┆ 0.833333 │
└──────┴────────┴────────────────┴────────┴────────┴─────────┴─────────┴──────────────┴──────────────┘