For context I work with mixed tabular data. I have complex data pipelines that I’d like to make sure works on any configuration of data.
I see the pandas add-on/extra and have some questions related to that.
How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between (0, nclasses-1)
and then one hot encoding after, but it adds up to have to do that every time.
How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?
Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.
Edit to add: 4. I would also be interested in trying a mix of columns as well and not always having all columns at all times.
For example this is what I have right now for data that mixes a continuous, binary, and multicategorical feature and then one-hot encodes the latter.
from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest
class TestTransforms(unittest.TestCase):
@given(
data_frames(
columns=[
# create continuous var
column("ctn", dtype=float),
# create binary var
column("bin", elements=st.integers(0, 1)),
# create multicategorical (numerically encoded) var
column("mult", elements=st.integers(0, 2)),
]
)
)
def test_hypothesis(self, df):
# one-hot encode the multicategorical column
df = pd.concat(
[
df.drop(["mult"], axis=1),
pd.get_dummies(df["mult"], prefix="mult"),
],
axis=1,
)
if __name__ == "__main__":
unittest.main()
Final edit: Here is the final version that works for me as I wanted it to!
from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest
def onehot_multicategorical_column(
prefix: str,
) -> Callable[[pd.DataFrame], pd.DataFrame]:
def integrate_onehots(df: pd.DataFrame) -> pd.DataFrame:
if df[prefix].empty:
return df
dummies = pd.get_dummies(df, columns=[prefix], prefix=prefix, dummy_na=True)
# Retain nans
dummies.loc[
dummies[f"{prefix}_nan"].astype(bool),
dummies.columns.str.startswith(prefix),
] = np.nan
return dummies.drop(f"{prefix}_nan", axis=1)
return integrate_onehots
def unpack_tuples(nested_tuples):
"""
We receive a List[Tuple[int, List[int]]].
The first int is the numerical id, and the second is the "time point".
We want to flatten this into a List[Tuple[int, int]] with the same
id for multiple time points.
E.g. [(0,[0,1,2]), (1,[0,2])] => [(0,0), (0,1), (0,2), (1,0), (1,2)]
"""
return [
(pt_id, time_pt) for pt_id, time_pts in nested_tuples for time_pt in time_pts
]
class TestTransforms(unittest.TestCase):
@given(
data_frames(
columns=[
column("ctn", dtype=float),
column("bin", elements=st.one_of(st.none(), st.integers(0, 1))),
column(
"mult", elements=st.one_of(st.none(), st.sampled_from([0, 1, 2]))
),
],
index=st.builds(
pd.MultiIndex.from_tuples,
st.lists(
st.tuples(
st.integers(0), st.lists(st.integers(0), min_size=1, max_size=5)
),
min_size=2,
).map(unpack_tuples),
),
).map(onehot_multicategorical_column("mult"))
)
def test_hypothesis(self, df):
def test_hypothesis(self, df):
# test stuff with df
if __name__ == "__main__":
unittest.main()
How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between
(0, nclasses-1)
and then one hot encoding after, but it adds up to have to do that every time.
That - or something equivalent like sampled_from(column_names)
- is exactly how I'd do it. A helper function and .map(categories_to_one_hot_columns)
method should make this reasonably easy.
How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?
The pdst.series()
and pdst.data_frames()
strategies both accept an index=
argument, which you could define as e.g.
index = st.builds(
pd.MultiIndex.from_tuples,
st.lists(st.tuples(...), min_size=1, max_size=10)
)
Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.
I'd use st.none() | st.integers()
for missingness, or more generally st.one_of(...)
, can be used to mix strategies together.