Search code examples
pythonjoinpython-polars

Polars asof join on next available date


I have a frame (events) which I want to join into another frame (fr), joining on Date and Symbol. There aren't necessarily any date overlaps. The date in events would match with the first occurrence only on the same or later date in fr, so if the event date is 2010-12-01, it would join on the same date or if not present then the next available date (2010-12-02).

I've tried to do this using search_sorted and join_asof but I'd like to group by the Symbol column and also this isn't a proper join. This somewhat works for a single Symbol only.

fr = pl.DataFrame(
    {
        'Symbol': ['A']*5,
        'Date': ['2010-08-29', '2010-09-01', '2010-09-05',
                 '2010-11-30', '2010-12-02'],
    }
).with_columns(pl.col('Date').str.to_date('%Y-%m-%d')).with_row_index().set_sorted("Date")

events = pl.DataFrame(
    {
         'Symbol': ['A']*3,
         'Earnings_Date': ['2010-06-01', '2010-09-01', '2010-12-01'],
         'Event': [1, 4, 7],
     }
).with_columns(pl.col('Earnings_Date').str.to_date('%Y-%m-%d')).set_sorted("Earnings_Date")

idx = fr["Date"].search_sorted(events["Earnings_Date"], "left")

fr = fr.with_columns(
    pl.when(
        pl.col("index").is_in(idx)
    )
    .then(True)
    .otherwise(False)
    .alias("Earnings")
)

fr = fr.join_asof(events, by="Symbol", left_on="Date", right_on="Earnings_Date")
fr = fr.with_columns(
    pl.when(
       pl.col("Earnings") == True
    )
    .then(pl.col("Event"))
    .otherwise(False)
    .alias("Event")
)

Solution

  • It sounds like you are on the right track using pl.DataFrame.join_asof. To group by the symbol the by parameter can be used.

    (
        fr
        .join_asof(
            events,
            left_on="Date", right_on="Earnings_Date",
            by="Symbol",
        )
    )
    
    shape: (5, 5)
    ┌───────┬────────┬────────────┬───────────────┬───────┐
    │ index ┆ Symbol ┆ Date       ┆ Earnings_Date ┆ Event │
    │ ---   ┆ ---    ┆ ---        ┆ ---           ┆ ---   │
    │ u32   ┆ str    ┆ date       ┆ date          ┆ i64   │
    ╞═══════╪════════╪════════════╪═══════════════╪═══════╡
    │ 0     ┆ A      ┆ 2010-08-29 ┆ 2010-06-01    ┆ 1     │
    │ 1     ┆ A      ┆ 2010-09-01 ┆ 2010-09-01    ┆ 4     │
    │ 2     ┆ A      ┆ 2010-09-05 ┆ 2010-09-01    ┆ 4     │
    │ 3     ┆ A      ┆ 2010-11-30 ┆ 2010-09-01    ┆ 4     │
    │ 4     ┆ A      ┆ 2010-12-02 ┆ 2010-12-01    ┆ 7     │
    └───────┴────────┴────────────┴───────────────┴───────┘
    

    Now, I understand that you'd like each event to be matched at most once. I don't believe this is possible with join_asof alone. However, we can set all event rows that equal the previous row to Null. For this, an pl.when().then() construct can be used.

    (
        fr
        .join_asof(
            events,
            left_on="Date", right_on="Earnings_Date",
            by="Symbol",
        )
        .with_columns(
            pl.when(
                pl.col("Earnings_Date", "Event").is_first_distinct()
            ).then(
                pl.col("Earnings_Date", "Event")
            ).over("Symbol")
        )
    )
    
    shape: (5, 5)
    ┌───────┬────────┬────────────┬───────────────┬───────┐
    │ index ┆ Symbol ┆ Date       ┆ Earnings_Date ┆ Event │
    │ ---   ┆ ---    ┆ ---        ┆ ---           ┆ ---   │
    │ u32   ┆ str    ┆ date       ┆ date          ┆ i64   │
    ╞═══════╪════════╪════════════╪═══════════════╪═══════╡
    │ 0     ┆ A      ┆ 2010-08-29 ┆ 2010-06-01    ┆ 1     │
    │ 1     ┆ A      ┆ 2010-09-01 ┆ 2010-09-01    ┆ 4     │
    │ 2     ┆ A      ┆ 2010-09-05 ┆ null          ┆ null  │
    │ 3     ┆ A      ┆ 2010-11-30 ┆ null          ┆ null  │
    │ 4     ┆ A      ┆ 2010-12-02 ┆ 2010-12-01    ┆ 7     │
    └───────┴────────┴────────────┴───────────────┴───────┘