I am producing a pd.DataFrame
using the hypothesis
library like so:
import datetime
from hypothesis import strategies as st
from hypothesis.extra.pandas import columns as cols
from hypothesis.extra.pandas import data_frames, indexes
data_frames(
columns=cols(
["sec1", "sec2", "sec3"], elements=st.floats(allow_infinity=False)
),
index=indexes(elements=st.dates(
min_value=datetime.date(2023,10,31),
max_value=datetime.date(2024,5,31))
),
).example()
sec1 sec2 sec3
2024-01-05 -3.333333e-01 NaN NaN
2024-05-20 -9.007199e+15 NaN -2.000010e+00
2024-02-28 -1.175494e-38 NaN 1.500000e+00
2024-01-24 -1.100000e+00 NaN 1.100000e+00
2023-11-19 -1.175494e-38 NaN -2.000010e+00
2024-05-28 -1.000000e-05 NaN 2.541486e+16
2024-01-31 -1.797693e+308 NaN NaN
2024-05-03 4.940656e-324 NaN -6.647158e+16
I need to make sure that an individual column doesn't exclusively consist of NaN
values.
Also, I want to avoid to create an empty pd.DataFrame
.
I think you can use filter
on the generated DataFrame example to exclude badly sampled datasets.
Use something like this:
data = data_frames(...)
data = data.filter(lambda tt: tt.isna().sum(axis=0).max() < tt.shape[0])
Or if you want to apply filter only to set of columns:
data = data_frames(...)
# or which is the same
col_names = ['sec2','sec3']
data = data.filter(lambda tt: tt[col_names].isna().sum(axis=0).max() < tt.shape[0])
One caveat that I've found - you need to apply filter to all columns together in one .filter
function. Otherwise you may still see columns with all NaNs. For example this instructions won't allow to create correct dataset:
# Doesn't work - Still may give columns with all Nones!
data = data_frames(...)
for col in colnames:
data = data.filter(lambda tt: tt[col].isna().sum() < tt.shape[0])
Result
So the final code is:
import datetime
from hypothesis import strategies as st
from hypothesis.extra.pandas import columns as cols
from hypothesis.extra.pandas import data_frames, indexes
aa = data_frames(
columns=cols(
["sec1", "sec2", "sec3"],
elements=st.floats(allow_infinity=False)
),
index=indexes(elements=st.dates(
min_value=datetime.date(2023,10,31),
max_value=datetime.date(2024,5,31)),
min_size = 1 # generate at least 1 row, prevents generating empty dataframe
),
)
# or which is the same
aa = aa.filter(lambda tt: tt.isna().sum(axis=0).max() < tt.shape[0])
print(aa.example())