Search code examples
pythonpandaspython-hypothesis

Python hypothesis dataframe assume column not exclusively consisting of NaN values


I am producing a pd.DataFrame using the hypothesis library like so:

import datetime
from hypothesis import strategies as st
from hypothesis.extra.pandas import columns as cols
from hypothesis.extra.pandas import data_frames, indexes

data_frames(
    columns=cols(
        ["sec1", "sec2", "sec3"], elements=st.floats(allow_infinity=False)
    ),
    index=indexes(elements=st.dates(
        min_value=datetime.date(2023,10,31), 
        max_value=datetime.date(2024,5,31))
    ),
).example()

                     sec1 sec2          sec3
2024-01-05  -3.333333e-01  NaN           NaN
2024-05-20  -9.007199e+15  NaN -2.000010e+00
2024-02-28  -1.175494e-38  NaN  1.500000e+00
2024-01-24  -1.100000e+00  NaN  1.100000e+00
2023-11-19  -1.175494e-38  NaN -2.000010e+00
2024-05-28  -1.000000e-05  NaN  2.541486e+16
2024-01-31 -1.797693e+308  NaN           NaN
2024-05-03  4.940656e-324  NaN -6.647158e+16

I need to make sure that an individual column doesn't exclusively consist of NaN values.

Also, I want to avoid to create an empty pd.DataFrame.


Solution

  • I think you can use filter on the generated DataFrame example to exclude badly sampled datasets.

    Use something like this:

    data = data_frames(...)
    data = data.filter(lambda tt: tt.isna().sum(axis=0).max() < tt.shape[0])
    

    Or if you want to apply filter only to set of columns:

    data = data_frames(...)
    
    # or which is the same
    col_names = ['sec2','sec3']
    data = data.filter(lambda tt: tt[col_names].isna().sum(axis=0).max() < tt.shape[0])
    

    One caveat that I've found - you need to apply filter to all columns together in one .filter function. Otherwise you may still see columns with all NaNs. For example this instructions won't allow to create correct dataset:

    # Doesn't work - Still may give columns with all Nones!
    data = data_frames(...)
    for col in colnames:
        data = data.filter(lambda tt: tt[col].isna().sum() < tt.shape[0])
    

    Result

    So the final code is:

    import datetime
    from hypothesis import strategies as st
    from hypothesis.extra.pandas import columns as cols
    from hypothesis.extra.pandas import data_frames, indexes
    
    
    aa = data_frames(
        columns=cols(
            ["sec1", "sec2", "sec3"], 
            elements=st.floats(allow_infinity=False)
        ),
        index=indexes(elements=st.dates(
            min_value=datetime.date(2023,10,31), 
            max_value=datetime.date(2024,5,31)),
            min_size = 1 # generate at least 1 row, prevents generating empty dataframe
        ),
    )
    
    
    # or which is the same
    aa = aa.filter(lambda tt: tt.isna().sum(axis=0).max() < tt.shape[0])
    
    print(aa.example())