Search code examples
pythonpandaspython-hypothesis

Generate a Pandas Dataframe with python hypothesis library where one row is dependant on another


I'm trying to use hypothesis to generate pandas dataframes where some column values are dependant on other column values. So far, I haven't been able to 'link' two columns.

This code snippet:

from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames , column, range_indexes

def create_dataframe():
    id1 = st.integers().map(lambda x: x)
    id2 = st.shared(id1).map(lambda x: x * 2)
    df = data_frames(index = range_indexes(min_size=10, max_size=100), columns=[
        column(name='id1',  elements=id1, unique=True),
        column(name='id2', elements=id2),
    ])
    return df

Produces a dataframe with a static second column:

            id1  program_id
0   1.170000e+02       110.0
1   3.600000e+01       110.0
2   2.876100e+04       110.0
3  -1.157600e+04       110.0
4   5.300000e+01       110.0
5   2.782100e+04       110.0
6   1.334500e+04       110.0
7  -3.100000e+01       110.0

Solution

  • I think that you're after the rows argument, which allows you to compute some column values from other columns. For example, if we wanted a full_price and a sale_price column where the sale price has some discount applied:

    from hypothesis import strategies as st
    from hypothesis.extra.pandas import data_frames, range_indexes
    
    def create_dataframe():
        full = st.floats(1, 1000)  # all items cost $1 to $1,000
        discounts = st.sampled_from([0, 0.1, 0.25, 0.5])
        rows = st.tuples(full, discounts).map(
            lambda xs: dict(price=xs[0], sale_price=xs[0] * (1-xs[1]))
        )
        return data_frames(
            index = range_indexes(min_size=10, max_size=100),
            rows = rows
        )
    
             price  sale_price
    0   757.264509  378.632254
    1   824.384095  618.288071
    2   401.187339  300.890504
    3   723.193610  650.874249
    4   777.171038  699.453934
    5   274.321034  205.740776
    

    So what went wrong with your example code? It looks like you imagined that the id1 and id2 strategies were defined relative to each other on a row-wise basis, but they're actually independent - and the shared() strategy shares a single value between every row in the column.