Search code examples
pythonwindow-functionspython-polars

Filter rows inside window function in Python Polars?


I need to compute the Herfindahl–Hirschman index ("HHI", the sum of squared market shares) but leaving out the firm represented in the row. Here's an example:

df = (pl.DataFrame({
    'year':(2023, 2023, 2023, 2024, 2024, 2024),
    'firm':('A', 'B', 'C', 'A', 'B', 'C'),
    'volume':(20, 50, 3, 25, 13, 5)
})
.with_columns(
    sum = pl.col('volume').sum().over('year'),
    leaveout_sum = (pl.col('volume').sum().over('year'))-(pl.col('volume'))
)
.with_columns(
    share = (pl.col('volume')/pl.col('sum'))*100
)
.with_columns(
    hhi = (pl.col('share')**2).sum().over('year').round()
))

Which gives:

┌──────┬──────┬────────┬─────┬──────────────┬───────────┬────────┐
│ year ┆ firm ┆ volume ┆ sum ┆ leaveout_sum ┆ share     ┆ hhi    │
│ ---  ┆ ---  ┆ ---    ┆ --- ┆ ---          ┆ ---       ┆ ---    │
│ i64  ┆ str  ┆ i64    ┆ i64 ┆ i64          ┆ f64       ┆ f64    │
╞══════╪══════╪════════╪═════╪══════════════╪═══════════╪════════╡
│ 2023 ┆ A    ┆ 20     ┆ 73  ┆ 53           ┆ 27.39726  ┆ 5459.0 │
│ 2023 ┆ B    ┆ 50     ┆ 73  ┆ 23           ┆ 68.493151 ┆ 5459.0 │
│ 2023 ┆ C    ┆ 3      ┆ 73  ┆ 70           ┆ 4.109589  ┆ 5459.0 │
│ 2024 ┆ A    ┆ 25     ┆ 43  ┆ 18           ┆ 58.139535 ┆ 4429.0 │
│ 2024 ┆ B    ┆ 13     ┆ 43  ┆ 30           ┆ 30.232558 ┆ 4429.0 │
│ 2024 ┆ C    ┆ 5      ┆ 43  ┆ 38           ┆ 11.627907 ┆ 4429.0 │
└──────┴──────┴────────┴─────┴──────────────┴───────────┴────────┘

The hhi column there is the normal HHI index, including all firms in the market, and I can compute the "leaveout" volume summation to get the sum of volumes of the other firms in that year. For example, the leaveout-HHI for firm A in 2023 would be the square of 3/53 plus the square of 50/53 (i.e., the squares of the market shares of firms B and C assuming firm A didn't exist).

How do I tell polars to do this? Is there a way to filter the window function, perhaps?

My real dataset includes almost 800 firms over 204 months for 500 separate markets, so doing this manually is out of the question.


Solution

  • You can take out the denominator from the sum of squares:

    .with_columns(
        leaveout_sum = (pl.col.volume.sum().over('year')) - pl.col.volume,
        leaveout_sum_of_sq = (pl.col.volume**2).sum().over('year') - pl.col.volume**2
    )
    .with_columns(
        leaveout_hhi = pl.col.leaveout_sum_of_sq / pl.col.leaveout_sum**2
    ))
    

    I left out the * 100 factor, the above correctly computes (3/53)^2 + (50/53)^2 for your example by doing (3^2 + 50^2) / (50 + 3)^2.