Search code examples

Fill shaped(sized) pandas Dataframe with values randomly by stat count value. Reverse action for .count()

I need a DataFrame with r rows and dynamic number of columns(based on groups). Input count column specifies how many True values are expected in the new DataFrame. My current implementation creates a temporary DataFrame with a single row containing a True value for each group in df, and then explode()'s that temporary dataframe. Finally, it groups by count and aggregates to result df



| group | count | ... 
|   A   |   2   |     
|   B   |   0   |     
|   C   |   4   |     
|   D   |   1   |     

And i need to fill new DataFrame with this values randomly (c-(columns) value is dynamic same as names)

expected output


NaN NaN True True
True NaN True NaN
NaN NaN True NaN
True NaN True NaN

I think it's possible to add a randomized set of length from 1 to r and after expanding and etc. just agg(sum) by this values.

my code


inputs = [
    {"group": "A", "count": 2},
    {"group": "B", "count": 0}, 
    {"group": "C", "count": 4}, 
    {"group": "D", "count": 1}, 
df = pd.DataFrame(inputs)

def expand(count:int, group: str) -> pd.DataFrame:
    """expands DF by counts"""
    count = int(round(count))
    df1 = pd.DataFrame([{group: True}])
    # I'm thinking here i need to add random seed
    df1 = df1.assign(count = [list(range(1, count+1))])\
    return df1

def creator(df: pd.DataFrame) -> pd.DataFrame:
    """create new DF for every group value(count)"""
    dfs = [expand(r, df['group'].values[0]) for r in list(df['count'].values)]
    df = pd.concat(dfs, ignore_index=True)
    return df
df.groupby('group', as_index=False)\
    .drop('count', axis=1)\
    # and groupby my seed

I tried to declare my questions if it will be helpful:

  1. Is there any method in pandas to make this easy/better?
  2. How can I make random counts and assign them in the expand() function?
  3. Is it a way to create sized DataFrame with NaN and then just drop there my values randomly(like pd.where or something)?

PS: This is my first time asking a question, so I hope I have provided enough information!


  • A pure pandas solution would be to use sample :

    out = pd.DataFrame(
        {g: [True]*c + [np.nan]*(R-c) for g, c in df.to_numpy()}

    Output :

          A   B     C     D
    0  True NaN  True   NaN
    1   NaN NaN  True   NaN
    2  True NaN   NaN   NaN
    3   NaN NaN  True  True
    4   NaN NaN  True   NaN

    Old answer :

    A simple approach would be to bootstrap a pre-null DataFrame while randomly choosing/picking a coordinate [index, column] :

    R = 5 # <-- rows
    out = pd.DataFrame(
        np.nan, index=range(R), columns=list(df["group"])
    for g, c in df.to_numpy():
        out.loc[np.random.choice(out.index, c, replace=False), g] = True