Search code examples
pythonpandasdataframepython-hypothesispandera

Pandera slow synthetic Dataframe generation if not eq schema


I am trying to learn the Pandera Library. When i try to automatically generate test Data from a DataFrameModel, it suddenly takes very long or crashes if i deviate from the minimal example in terms of limit checks.

Consider the Base example from the Pandera documentation: https://pandera.readthedocs.io/en/latest/data_synthesis_strategies.html#strategies-and-examples-from-dataframe-models

I can expand it with lots of additional columns. Consider the Code:

from pandera.typing import Series, DataFrame
import pandera as pa
import hypothesis


class InSchema(pa.DataFrameModel):
    column1: Series[int] = pa.Field(eq=10)
    column2: Series[float] = pa.Field(eq=0.25)
    column3: Series[str] = pa.Field(eq="foo")
    column4: Series[int] = pa.Field()
    column5: Series[int] = pa.Field(eq=123)
    column6: Series[int] = pa.Field(eq=123)
    column7: Series[int] = pa.Field(eq=123)


class OutSchema(InSchema):
    column4: Series[float]


@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(column4=df.column1 * df.column2)



def test_processing_fn():
    dataframe = InSchema.example(size=5)
    processing_fn(dataframe)
    print(dataframe)


if __name__ == "__main__":
    gg= InSchema.strategy(size=5)
    test_processing_fn()
    print("Done!")

Alright. Now Observe the if i Change one columns limits to unique=True:

from pandera.typing import Series, DataFrame
import pandera as pa
import hypothesis


class InSchema(pa.DataFrameModel):
# NOW UNIQUE
    column1: Series[int] = pa.Field(unique=True)
    column2: Series[float] = pa.Field(eq=0.25)
    column3: Series[str] = pa.Field(eq="foo")
    column4: Series[int] = pa.Field()
    column5: Series[int] = pa.Field(eq=123)
    column6: Series[int] = pa.Field(eq=123)
    column7: Series[int] = pa.Field(eq=123)


class OutSchema(InSchema):
    column4: Series[float]


@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(column4=df.column1 * df.column2)



def test_processing_fn():
    dataframe = InSchema.example(size=5)
    processing_fn(dataframe)
    print(dataframe)


if __name__ == "__main__":
    gg= InSchema.strategy(size=5)
    test_processing_fn()
    print("Done!")

The Software is able to infer a dataframe. But if I now change one check to an equivalent check (eq=123 <=> le=123 && ge= 123), it fails:

from pandera.typing import Series, DataFrame
import pandera as pa
import hypothesis


class InSchema(pa.DataFrameModel):
    column1: Series[int] = pa.Field(unique=True)
    column2: Series[float] = pa.Field(eq=0.25)
    column3: Series[str] = pa.Field(eq="foo")
    column4: Series[int] = pa.Field()
    # EQUIVALENT CHECKS
    column5: Series[int] = pa.Field(le=123, ge=123)
    column6: Series[int] = pa.Field(eq=123)
    column7: Series[int] = pa.Field(eq=123)


class OutSchema(InSchema):
    column4: Series[float]


@pa.check_types
def processing_fn(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
    return df.assign(column4=df.column1 * df.column2)



def test_processing_fn():
    dataframe = InSchema.example(size=5)
    processing_fn(dataframe)
    print(dataframe)


if __name__ == "__main__":
    gg= InSchema.strategy(size=5)
    test_processing_fn()
    print("Done!")

Now i get an error:

hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function

My question is, why? If i use both restrictions (unique and ge/le) separated, it works but both cannot be satisfiable?


Solution

  • This is because Pandera has a fairly inefficient approach: they generate a custom strategy for the first part of the schema, and then use rejection sampling to filter out values which fail any subsequent filter. That's why their docs recommend specifying the most-restrictive constraint as the base strategy.

    As a Hypothesis maintainer I've been sad about this for several years, opened this issue three years ago, built some runtime optimizations in Hypothesis, and after reading this have opened a PR so that hopefully Pandera users will actually get the performance features I made for them.


    update: the optimizations have been released in Pandera 0.18.1, on 2024-03-10. On older versions, or if you still have performance issues, you might want to use our pandas strategies directly :-)