I need a DataFrame
with r
rows and dynamic number of columns(based on groups).
Input count
column specifies how many True
values are expected in the new DataFrame
. My current implementation creates a temporary DataFrame
with a single row containing a True
value for each group
in df, and then explode()
's that temporary dataframe. Finally, it groups by count
and aggregates to result df
--
| group | count | ...
| A | 2 |
| B | 0 |
| C | 4 |
| D | 1 |
And i need to fill new DataFrame
with this values randomly (c
-(columns) value is dynamic same as names)
--
A | B | C | D |
---|---|---|---|
NaN | NaN | True | True |
True | NaN | True | NaN |
NaN | NaN | NaN | NaN |
NaN | NaN | True | NaN |
True | NaN | True | NaN |
I think it's possible to add a randomized set of length from 1 to r
and after expanding and etc. just agg(sum) by this values.
--
inputs = [
{"group": "A", "count": 2},
{"group": "B", "count": 0},
{"group": "C", "count": 4},
{"group": "D", "count": 1},
]
df = pd.DataFrame(inputs)
def expand(count:int, group: str) -> pd.DataFrame:
"""expands DF by counts"""
count = int(round(count))
df1 = pd.DataFrame([{group: True}])
# I'm thinking here i need to add random seed
df1 = df1.assign(count = [list(range(1, count+1))])\
.explode('count')\
.reset_index(drop=True)
return df1
def creator(df: pd.DataFrame) -> pd.DataFrame:
"""create new DF for every group value(count)"""
dfs = [expand(r, df['group'].values[0]) for r in list(df['count'].values)]
df = pd.concat(dfs, ignore_index=True)
return df
df.groupby('group', as_index=False)\
.apply(creator)\
.drop('count', axis=1)\
# and groupby my seed
.groupby(level=1)\
.agg(sum)
I tried to declare my questions if it will be helpful:
expand()
function?DataFrame
with NaN
and then just drop there my values randomly(like pd.where
or something)?PS: This is my first time asking a question, so I hope I have provided enough information!
A pure pandas solution would be to use sample
:
out = pd.DataFrame(
{g: [True]*c + [np.nan]*(R-c) for g, c in df.to_numpy()}
).sample(frac=1)
Output :
print(out)
A B C D
0 True NaN True NaN
1 NaN NaN True NaN
2 True NaN NaN NaN
3 NaN NaN True True
4 NaN NaN True NaN
Old answer :
A simple approach would be to bootstrap a pre-null DataFrame
while randomly choosing/picking a coordinate [index, column]
:
np.random.seed(0)
R = 5 # <-- rows
out = pd.DataFrame(
np.nan, index=range(R), columns=list(df["group"])
)
for g, c in df.to_numpy():
out.loc[np.random.choice(out.index, c, replace=False), g] = True