Search code examples
pythondataframemultiplicationpython-polars

Multiply polars columns of number type with object type (which supports __mul__)


I have the following code.

import polars as pl

class Summary:
    def __init__(self, value: float, origin: str):
        self.value  = value
        self.origin = origin

    def __repr__(self) -> str:
        return f'Summary({self.value},{self.origin})'

    def __mul__(self, x: float | int) -> 'Summary':
        return Summary(self.value * x, self.origin)

    def __rmul__(self, x: float | int) -> 'Summary':
        return self * x

mapping = {
    'CASH':  Summary( 1, 'E'),
    'ITEM':  Summary(-9, 'A'),
    'CHECK': Summary(46, 'A'),
}

df = pl.DataFrame({'quantity': [7, 4, 10], 'type': mapping.keys(), 'summary': mapping.values()})

The dataframe df looks as follows.

shape: (3, 3)
┌──────────┬───────┬───────────────┐
│ quantity ┆ type  ┆ summary       │
│ ---      ┆ ---   ┆ ---           │
│ i64      ┆ str   ┆ object        │
╞══════════╪═══════╪═══════════════╡
│ 7        ┆ CASH  ┆ Summary(1,E)  │
│ 4        ┆ ITEM  ┆ Summary(-9,A) │
│ 10       ┆ CHECK ┆ Summary(46,A) │
└──────────┴───────┴───────────────┘

Especially, the summary column contains a Summary class object, which supports multiplication. Now, I'd like to multiply this column with the quantity column.

However, the naive approach raises an error.

df.with_columns(pl.col('quantity').mul(pl.col('summary')).alias('qty_summary'))
SchemaError: failed to determine supertype of i64 and object

Is there a way to multiply these columns?


Solution

  • Remember that Polars is designed so that computations run in Rust, not Python, where it's like 1000x faster. If you have Python operations you want to run, you lose a lot of the benefit of using Polars in the first place.

    But, thankfully, Polars does have a very nice feature that is relevant here, which is “native” processing of dataclasses.

    import polars as pl
    from dataclasses import dataclass
    
    
    @dataclass
    class Summary:
        value: float
        origin: str
    
        def __mul__(self, x: float | int) -> "Summary":
            return Summary(self.value * x, self.origin)
    
        def __rmul__(self, x: float | int) -> "Summary":
            return self * x
    
    
    mapping = {
        "CASH": Summary(1, "E"),
        "ITEM": Summary(-9, "A"),
        "CHECK": Summary(46, "A"),
    }
    
    df = pl.DataFrame(
        {
            "quantity": [7, 4, 10],
            "type": mapping.keys(),
            "summary": mapping.values(),
        }
    )
    
    df
    

    Because Summary is a dataclass, you 1. don't need __init__ and __repr__ (they come for free), and 2. don't need to do anything special for Polars to struct-ify them.

    shape: (3, 3)
    ┌──────────┬───────┬────────────┐
    │ quantity ┆ type  ┆ summary    │
    │ ---      ┆ ---   ┆ ---        │
    │ i64      ┆ str   ┆ struct[2]  │
    ╞══════════╪═══════╪════════════╡
    │ 7        ┆ CASH  ┆ {1.0,"E"}  │
    │ 4        ┆ ITEM  ┆ {-9.0,"A"} │
    │ 10       ┆ CHECK ┆ {46.0,"A"} │
    └──────────┴───────┴────────────┘
    

    Now you can just do regular Polars struct ops:

    df.with_columns(
        qty_summary=pl.struct(
            pl.col("summary").struct.field("value") * pl.col("quantity"),
            pl.col("summary").struct.field("origin"),
        )
    )