I am considering pandera
to implement strong typing of my project uses polars
dataframes.
I am puzzled on how I can type my functions correctly.
As an example let's have:
import polars as pl
import pandera.polars as pa
from pandera.typing.polars import LazyFrame as PALazyFrame
class MyModel(pa.DataFrameModel):
a: int
class Config:
strict = True
def foo(
f: pl.LazyFrame
) -> PALazyFrame[MyModel]:
# Our input is unclean, probably coming from pl.scan_parquet on some files
# The validation is dummy here
return MyModel.validate(f.select('a'))
If I'm calling mypy
it will return the following error
error: Incompatible return value type (got "DataFrameBase[MyModel]", expected "LazyFrame[MyModel]")
Sure, I can modify my signature to specify the return Type DataFrameBase[MyModel]
, but I'll lose the precision that I'm returning a LazyFrame.
Further more LazyFrame is defined as implementing DataFrameBase in pandera
code.
How can I fix my code so that the return type LazyFrame[MyModel] works?
It's quite often an issue when underlying libraries maybe don't express types as well described as they could - fortunately there are a few ways around it:
As discussed in the comments, using typing.cast
is always an option. If an external library does not produce a specific enough type this is often what I opt for - it's a lot better than using type:ignore
, and allows you to "bridge the gap" in an otherwise well-typed codebase. E.g.
import polars as pl
import pandera.polars as pa
from pandera.typing.polars import LazyFrame as PALazyFrame
import typing
class MyModel(pa.DataFrameModel):
a: int
class Config:
strict = True
def foo(
f: pl.LazyFrame
) -> PALazyFrame[MyModel]:
# Our input is unclean, probably coming from pl.scan_parquet on some files
# The validation is dummy here
return typing.cast(PALazyFrame[MyModel],MyModel.validate(f.select('a')))
As mentioned though, there are times when the cast type has to be manually adjusted - and also this would have to be done in potentially multiple places for validate.
Just supposing we need to use this model in lots of places in the code, you may wish to push the "cast" a little further from an end user. The cast doesn't really go away, but it allows us to put it in a place that could be highly reused, and reduce the number of casts in the codebase (always a good aim!). Note that the underlying code from the original library does use a cast for this method, so we're effectively just recasting to something slightly different.
In the below example, there is a new method specifically for validating lazy frames - it operates in all the same ways as regular validate, except that it takes a LazyFrame and outputs a PALazyFrame:
import polars as pl
import pandera.polars as pa
from pandera.typing.polars import LazyFrame as PALazyFrame
from typing import Optional, Self, cast
class MyDataFrameModel(pa.DataFrameModel):
@classmethod
def validate_lazy(
cls,
check_obj: pl.LazyFrame,
head: Optional[int]=None,
tail: Optional[int]=None,
sample: Optional[int]=None,
random_state: Optional[int]=None,
lazy: bool=False,
inplace: bool=False
) -> PALazyFrame[Self]:
return cast(PALazyFrame[Self], cls.validate(
check_obj,
head,
tail,
sample,
random_state,
lazy,
inplace
))
class MyModel(MyDataFrameModel):
a: int
class Config:
strict = True
def foo(
f: pl.LazyFrame
) -> PALazyFrame[MyModel]:
# Our input is unclean, probably coming from pl.scan_parquet on some files
# The validation is dummy here
return MyModel.validate_lazy(f.select('a'))
I originally considered simply overwriting the original validate method with one that another that was more generic and allowed for this use case but I found:
a) It's difficult to express the right output type
b) Overwriting methods in an incompatible manner from the inherited model is banned in modern Python anyway.
Failing that, pretty much your only option is to request a change to the underlying library. Its possible there's a way to express this method in a more generic fashion that would allow for your use case, however keep in mind that there are some typing structures that simply cannot be expressed without the existence of "Higher Kinded Types", which currently don't exist in Python. I would suspect this may be one such use case.
Hope this helps!