Search code examples
pythonpython-hypothesisproperty-based-testing

Hypothesis, using "one_of" with Pandas dtypes in the "data_frames" strategy


I would like to construct a Pandas series that is any of several dtypes.

I was hoping to do something like this:

from hypothesis import given
import hypothesis.strategies as hs
import hypothesis.extra.numpy as hs_np
import hypothesis.extra.pandas as hs_pd
import numpy as np
import pandas as pd
import pandera as pda
import pytest

data_schema = pda.DataFrameSchema(...)

def dtype_not_float64() -> hs.SearchStrategy[np.dtype]:
    return hs.one_of(
        hs_np.integer_dtypes(),
        hs_np.complex_number_dtypes(),
        hs_np.datetime64_dtypes(),
        hs_np.timedelta64_dtypes(),
    )

@given(
hs_pandas.data_frames([
        hs_pd.column("x", dtype=dtype_not_float64()),
        hs_pd.column("y", dtype=dtype_not_float64()),
        hs_pd.column("z", dtype=dtype_not_float64()),
    ])
)
def test_invalid(df: pd.DataFrame) -> None:
    r"""Test that the schema does not pass invalid data."""
    with pytest.raises(SchemaError):
        _ = data_schema(df)

Arguably this is a silly test, but I hope it serves to illustrate what I am trying to achieve.

However, I got this error:

E   hypothesis.errors.InvalidArgument: Cannot convert dtype=one_of(integer_dtypes(), complex_number_dtypes(), datetime64_dtypes(), timedelta64_dtypes()) of type OneOfStrategy to type dtype

Apparently one_of() won't work with the dtypes= parameter here.

Is there a straightforward way to generate a column with multiple possible dtypes?


Solution

  • This code is failing because the dtype= argument to columns must actually be a dtype, not a strategy to generate dtypes (docs). And unfortunately column objects are a special placeholder object, so you can't st.one_of() those either...

    Solution: build up strategies for each series, put those in a list, and pd.concat() them into a dataframe:

    df = st.tuples(*[
        dtype_not_float64().flatmap(lambda dt: hs_pd.column(name, dtype=dt))
        for name in ["x", "y", "z"]
    ]).map(lambda ss: pd.concat(ss, axis=1))
    

    ...although this is fiddly enough that I'd suggest using an explicit @st.composite function to make the logic more obvious:

    @st.composite
    def dataframes_with_names_and_dtypes(draw, names, dtype_strategy):
        cols = [hs_pd.column(name, dtype=draw(dtype_strategy)) for name in names]
        return draw(hs_pandas.data_frames(cols))
    
    df = dataframes_with_names_and_dtypes(["x", "y", "z"], dtype_not_float64())