Search code examples
dataframepython-polars

Polars dataframe: create new columns while pivot dataframe based on value in a column


Say, I have a polars dataframe as below:

df = pl.DataFrame(
 {
     "date": ["d1", "d1", "d1", "d1", "d1", "d1"],
     "ticker": ["t1", "t1", "t1", "t1", "t1", "t1"],
     "positionType": ["Long", "Long", "Short", "Short", "Long", "Short"],
     "predictionType": ["p1", "s1", "p1", "s1", "p1", "p1"],
     "holdingBars": [1, 1, 1, 1, 2, 2],
     "res": [1, 2, 3, 4, 5, 6],
 }
)
date ticker positionType predictionType holdingBars res
0 d1 t1 Long p1 1
1 d1 t1 Long s1 1
2 d1 t1 Short p1 1
3 d1 t1 Short s1 1
4 d1 t1 Long p1 2
5 d1 t1 Short p1 2

I have tried pivoting it as below:

df.pivot(
 values="res", index=["date", "ticker", "predictionType"], columns=["positionType", "holdingBars"], aggregate_function="first")

Output:

date ticker predictionType Long Short 1 2
d1 t1 p1 1 3 1 5
d1 t1 s1 2 4 2 null
  1. I would, however, like to see it as below. basically, a concatenation of positionType and holdingBars columns and the corresponding res value
date ticker predictionType Long_1 Short_1 Long_2 Short_2
d1 t1 p1 1 3 5 6
d1 t1 s1 2 4 null null
  1. I would, also like to add additional columns: Long_1/Short_1 and Long_2/Short_2
date ticker predictionType Long_1 Short_1 Long_2 Short_2 Long_1 / Short_1 Long_2 / Short_2
d1 t1 p1 1 3 5 6 0.3 0.83
d1 t1 s1 2 4 null null 0.5 NaN

Please note that this is a subset of data, it can have multiple tickers across multiple dates and res can be across multiple holding periods.

Any idea how an achieve these, please?

ta!


Solution

  • You can create the column name before pivoting:

    (
        df.with_columns(
            columns=pl.col('positionType') + '_' + pl.col('holdingBars').cast(str)
        ).pivot(
            values='res',
            index=['date', 'ticker', 'predictionType'],
            columns='columns',
            aggregate_function='first',
        )
    )
    
    shape: (2, 7)
    ┌──────┬────────┬────────────────┬────────┬─────────┬────────┬─────────┐
    │ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Short_1 ┆ Long_2 ┆ Short_2 │
    │ ---  ┆ ---    ┆ ---            ┆ ---    ┆ ---     ┆ ---    ┆ ---     │
    │ str  ┆ str    ┆ str            ┆ i64    ┆ i64     ┆ i64    ┆ i64     │
    ╞══════╪════════╪════════════════╪════════╪═════════╪════════╪═════════╡
    │ d1   ┆ t1     ┆ p1             ┆ 1      ┆ 3       ┆ 5      ┆ 6       │
    │ d1   ┆ t1     ┆ s1             ┆ 2      ┆ 4       ┆ null   ┆ null    │
    └──────┴────────┴────────────────┴────────┴─────────┴────────┴─────────┘
    

    Long_1/Short_1

    As your Long/Short items are pairs, and in order - you could groupby and use first/last to isolate them:

    (
       df.with_columns(
          (pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
             .alias('predId')
       )
       .groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
       .agg(
          pl.first('res').alias('Long'),
          pl.last('res').alias('Short'),
          (pl.first('res') / pl.last('res')).alias('Long/Short'),
       )
    )
    
    shape: (3, 7)
    ┌──────┬────────┬────────────────┬────────┬──────┬───────┬────────────┐
    │ date ┆ ticker ┆ predictionType ┆ predId ┆ Long ┆ Short ┆ Long/Short │
    │ ---  ┆ ---    ┆ ---            ┆ ---    ┆ ---  ┆ ---   ┆ ---        │
    │ str  ┆ str    ┆ str            ┆ u32    ┆ i64  ┆ i64   ┆ f64        │
    ╞══════╪════════╪════════════════╪════════╪══════╪═══════╪════════════╡
    │ d1   ┆ t1     ┆ p1             ┆ 1      ┆ 1    ┆ 3     ┆ 0.333333   │
    │ d1   ┆ t1     ┆ s1             ┆ 1      ┆ 2    ┆ 4     ┆ 0.5        │
    │ d1   ┆ t1     ┆ p1             ┆ 2      ┆ 5    ┆ 6     ┆ 0.833333   │
    └──────┴────────┴────────────────┴────────┴──────┴───────┴────────────┘
    

    You can then .melt() the ['date', 'ticker', 'predictionType', 'predId'] columns and pivot.

    (
       df.with_columns(
          (pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
             .alias('predId')
       )
       .groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
       .agg(
          pl.first('res').alias('Long'),
          pl.last('res').alias('Short'),
          (pl.first('res') / pl.last('res')).alias('Long/Short')
       )
       .melt(
          id_vars = ['date', 'ticker', 'predictionType', 'predId'],
          variable_name = 'column'
       )
       .with_columns(
          pl.col('column') + '_' + pl.col('predId').cast(str)
       ).pivot(
          values = 'value',
          index = ['date', 'ticker', 'predictionType'],
          columns = 'column',
          aggregate_function = None
       )
    )
    
    shape: (2, 9)
    ┌──────┬────────┬────────────────┬────────┬────────┬─────────┬─────────┬──────────────┬──────────────┐
    │ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Long_2 ┆ Short_1 ┆ Short_2 ┆ Long/Short_1 ┆ Long/Short_2 │
    │ ---  ┆ ---    ┆ ---            ┆ ---    ┆ ---    ┆ ---     ┆ ---     ┆ ---          ┆ ---          │
    │ str  ┆ str    ┆ str            ┆ f64    ┆ f64    ┆ f64     ┆ f64     ┆ f64          ┆ f64          │
    ╞══════╪════════╪════════════════╪════════╪════════╪═════════╪═════════╪══════════════╪══════════════╡
    │ d1   ┆ t1     ┆ p1             ┆ 1.0    ┆ 5.0    ┆ 3.0     ┆ 6.0     ┆ 0.333333     ┆ 0.833333     │
    │ d1   ┆ t1     ┆ s1             ┆ 2.0    ┆ null   ┆ 4.0     ┆ null    ┆ 0.5          ┆ null         │
    └──────┴────────┴────────────────┴────────┴────────┴─────────┴─────────┴──────────────┴──────────────┘
    

    The use of .melt causes the upcasting to floats which may or may not be an issue.

    It's also possible to "pivot" in the following manner:

    pivots = 'Long', 'Short', 'Long/Short'
    
    (
       df.with_columns(
          (pl.col('positionType') == 'Long').cumsum().over('ticker', 'predictionType')
             .alias('predId')
       )
       .groupby('date', 'ticker', 'predictionType', 'predId', maintain_order=True)
       .agg(
          pl.first('res').alias('Long'),
          pl.last('res').alias('Short'),
          (pl.first('res') / pl.last('res')).alias('Long/Short')
       )
       .groupby('date', 'ticker', 'predictionType')
       .agg(pivots)
       .with_columns(
          pl.col(col).list.to_struct(
             n_field_strategy = 'max_width',
             fields = lambda idx, col=col: f'{col}_{idx + 1}'
          )
          for col in pivots
       )
       .unnest(pivots)
    )
    
    shape: (2, 9)
    ┌──────┬────────┬────────────────┬────────┬────────┬─────────┬─────────┬──────────────┬──────────────┐
    │ date ┆ ticker ┆ predictionType ┆ Long_1 ┆ Long_2 ┆ Short_1 ┆ Short_2 ┆ Long/Short_1 ┆ Long/Short_2 │
    │ ---  ┆ ---    ┆ ---            ┆ ---    ┆ ---    ┆ ---     ┆ ---     ┆ ---          ┆ ---          │
    │ str  ┆ str    ┆ str            ┆ i64    ┆ i64    ┆ i64     ┆ i64     ┆ f64          ┆ f64          │
    ╞══════╪════════╪════════════════╪════════╪════════╪═════════╪═════════╪══════════════╪══════════════╡
    │ d1   ┆ t1     ┆ s1             ┆ 2      ┆ null   ┆ 4       ┆ null    ┆ 0.5          ┆ null         │
    │ d1   ┆ t1     ┆ p1             ┆ 1      ┆ 5      ┆ 3       ┆ 6       ┆ 0.333333     ┆ 0.833333     │
    └──────┴────────┴────────────────┴────────┴────────┴─────────┴─────────┴──────────────┴──────────────┘