Search code examples
python-polars

Combine Polars dataframe with shifted-version into a single df using struct


I have a dataframe with stock price history of companies by date.

print(prices)

shape: (5, 4)
┌──────┬─────┬───────┬──────────┐
│ date ┆ ibm ┆ apple ┆ coinbase │
│ ---  ┆ --- ┆ ---   ┆ ---      │
│ i64  ┆ i64 ┆ i64   ┆ i64      │
╞══════╪═════╪═══════╪══════════╡
│ 1    ┆ 50  ┆ 100   ┆ 40       │
│ 2    ┆ 48  ┆ 200   ┆ 60       │
│ 3    ┆ 49  ┆ 400   ┆ 10       │
│ 4    ┆ 27  ┆ 800   ┆ 300      │
│ 5    ┆ 11  ┆ 1600  ┆ 7000     │
└──────┴─────┴───────┴──────────┘

And another with prices shifted by 2 date units.

prices_2 = prices.shift(-2)

shape: (5, 4)
┌──────┬──────┬───────┬──────────┐
│ date ┆ ibm  ┆ apple ┆ coinbase │
│ ---  ┆ ---  ┆ ---   ┆ ---      │
│ i64  ┆ i64  ┆ i64   ┆ i64      │
╞══════╪══════╪═══════╪══════════╡
│ 3    ┆ 49   ┆ 400   ┆ 10       │
│ 4    ┆ 27   ┆ 800   ┆ 300      │
│ 5    ┆ 11   ┆ 1600  ┆ 7000     │
│ null ┆ null ┆ null  ┆ null     │
│ null ┆ null ┆ null  ┆ null     │
└──────┴──────┴───────┴──────────┘

I want to bundle the 2 ibm series into structs in a single ibm column, with a field for the price @ date, and the shift(-2) price. Similar for apple etc.

How can I do this? Would it be efficient?

Don't know where to start, as it seems the resulting ibm column would not consist of 2 series, but a single series of pairs of (original, shifted) prices


Solution

  • pl.struct takes any number of expressions, allowing for putting shift expressions directly into it:

    df.with_columns(
        pl.struct(col, pl.col(col).shift(-2).suffix('2'))
        for col in df.columns
        if col != 'date'
    )
    
    shape: (5, 4)
    ┌──────┬───────────┬─────────────┬─────────────┐
    │ date ┆ ibm       ┆ apple       ┆ coinbase    │
    │ ---  ┆ ---       ┆ ---         ┆ ---         │
    │ i64  ┆ struct[2] ┆ struct[2]   ┆ struct[2]   │
    ╞══════╪═══════════╪═════════════╪═════════════╡
    │ 1    ┆ {50,49}   ┆ {100,400}   ┆ {40,10}     │
    │ 2    ┆ {48,27}   ┆ {200,800}   ┆ {60,300}    │
    │ 3    ┆ {49,11}   ┆ {400,1600}  ┆ {10,7000}   │
    │ 4    ┆ {27,null} ┆ {800,null}  ┆ {300,null}  │
    │ 5    ┆ {11,null} ┆ {1600,null} ┆ {7000,null} │
    └──────┴───────────┴─────────────┴─────────────┘
    

    Here the underlying fields would be ibm and ibm2 for the ibm column (see suffix), etc., you can use alias or the like to rename the columns/fields whatever you want.

    df.with_columns(...).schema
    
    {'date': Int64, 'ibm': Struct([Field('ibm', Int64), Field('ibm2', Int64)]), 'apple': Struct([Field('apple', Int64), Field('apple2', Int64)]), 'coinbase': Struct([Field('coinbase', Int64), Field('coinbase2', Int64)])}