Search code examples
python-polars

Polars mul uses lesser scale


This might be linked to Polars DataFrame - Decimal Precision doubles on mul with Integer.

I am facing issues in rounding at various stages in polars (v1.6.0) multiplication. Here is an example:

from decimal import Decimal
import polars as pl


df = pl.DataFrame({
    "a": [2, 5],
    "b": [Decimal('2.0'), Decimal('5.0')],
    "c": [Decimal('2.00'), Decimal('5.00')],
    "d": [Decimal('2.000'), Decimal('5.000')],
    "e": [Decimal('2.0000'), Decimal('5.0000')]
})
shape: (2, 5)
┌─────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ a   ┆ b            ┆ c            ┆ d            ┆ e            │
│ --- ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
│ i64 ┆ decimal[*,1] ┆ decimal[*,2] ┆ decimal[*,3] ┆ decimal[*,4] │
╞═════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ 2   ┆ 2.0          ┆ 2.00         ┆ 2.000        ┆ 2.0000       │
│ 5   ┆ 5.0          ┆ 5.00         ┆ 5.000        ┆ 5.0000       │
└─────┴──────────────┴──────────────┴──────────────┴──────────────┘
# Demo 1
# 10/71 = 0.14084507042253522
dfn = df.with_columns(
    pl.col(col).mul(10/71)
    for col in df.columns
)
shape: (2, 5)
┌──────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ a        ┆ b            ┆ c            ┆ d            ┆ e            │
│ ---      ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
│ f64      ┆ decimal[*,2] ┆ decimal[*,4] ┆ decimal[*,6] ┆ decimal[*,8] │
╞══════════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ 0.28169  ┆ 0.20         ┆ 0.2800       ┆ 0.280000     ┆ 0.28160000   │
│ 0.704225 ┆ 0.50         ┆ 0.7000       ┆ 0.700000     ┆ 0.70400000   │
└──────────┴──────────────┴──────────────┴──────────────┴──────────────┘
# Demo 2
dfn = df.with_columns(
    pl.col(col).mul(Decimal(10/71).quantize((Decimal('1.0000'))))
    for col in df.columns
)
shape: (2, 5)
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ a            ┆ b            ┆ c            ┆ d            ┆ e            │
│ ---          ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
│ decimal[*,8] ┆ decimal[*,5] ┆ decimal[*,6] ┆ decimal[*,7] ┆ decimal[*,8] │
╞══════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ 0.28160000   ┆ 0.28160      ┆ 0.281600     ┆ 0.2816000    ┆ 0.28160000   │
│ 0.70400000   ┆ 0.70400      ┆ 0.704000     ┆ 0.7040000    ┆ 0.70400000   │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

In Demo 1 above, when I just multiply with 10/71, column b [decimal[*,1] seems to be multiplied by 0.1 instead of 0.140845..., c decimal[*,2] seems to be multiplied by 0.14, d decimal[*,3] by 0.140 and e decimal[*,4] by 0.1408. In short, whatever is the original decimal scale of the column that is multiplied by 10/71, the same scale of 10/71 seems to be used.

As seen in Demo 2, I must quantize the scale before I use it for multiplication. Note that I cannot just use .mul(Decimal(10/71)) without quantize, as doing so gives an error BindingsError: "Decimal is too large to fit in Decimal128".

Is there a better way than forcing the scale of the multiplier?


Solution

  • It seems like it's doing the right thing to me so it's not clear what better result you're looking for.

    (Not withstanding the bug (as it seems to me) where it doubles the scale)

    Another demo
    Decimal("2.0")*(10/71)
    # TypeError: unsupported operand type(s) for *: 'decimal.Decimal' and 'float'
    

    so the python Decimal package won't even try to do Decimal * float so it's hard to tell what "better" is.

    As you note, what polars seems to be doing is converting the float to the same scale as the lhs Decimal. I can't think of a better default than that.

    First alternative

    Have polars do two integer operations instead of python creating the float

    df.with_columns(
        pl.all().mul(10).truediv(71)
    )
    shape: (2, 5)
    ┌──────────┬──────────────┬──────────────┬───────────────┬────────────────┐
    │ a        ┆ b            ┆ c            ┆ d             ┆ e              │
    │ ---      ┆ ---          ┆ ---          ┆ ---           ┆ ---            │
    │ f64      ┆ decimal[*,6] ┆ decimal[*,8] ┆ decimal[*,10] ┆ decimal[*,12]  │
    ╞══════════╪══════════════╪══════════════╪═══════════════╪════════════════╡
    │ 0.28169  ┆ 0.281690     ┆ 0.28169014   ┆ 0.2816901408  ┆ 0.281690140845 │
    │ 0.704225 ┆ 0.704225     ┆ 0.70422535   ┆ 0.7042253521  ┆ 0.704225352112 │
    └──────────┴──────────────┴──────────────┴───────────────┴────────────────┘
    

    In this case the mul doubles the scale (which I think is a bug) and then the division seems to add 4 to the scale which I can't figure out (maybe that's a bug too or just an intended hard coded default).

    Second alternative

    Use the Decimal form of 10/71 specified to the scale you want such as

    df.with_columns(
        pl.all().mul(Decimal("0.1408"))
    )
    shape: (2, 5)
    ┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
    │ a            ┆ b            ┆ c            ┆ d            ┆ e            │
    │ ---          ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
    │ decimal[*,8] ┆ decimal[*,5] ┆ decimal[*,6] ┆ decimal[*,7] ┆ decimal[*,8] │
    ╞══════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
    │ 0.28160000   ┆ 0.28160      ┆ 0.281600     ┆ 0.2816000    ┆ 0.28160000   │
    │ 0.70400000   ┆ 0.70400      ┆ 0.704000     ┆ 0.7040000    ┆ 0.70400000   │
    └──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
    

    In this case (ignoring column a) it looks like it's adding the scales. That is the same behavior as Decimal("2.00")*Decimal("0.1408") so I think that's right.