Search code examples
pythonpandasrandom

Fill shaped(sized) pandas Dataframe with values randomly by stat count value. Reverse action for .count()


I need a DataFrame with r rows and dynamic number of columns(based on groups). Input count column specifies how many True values are expected in the new DataFrame. My current implementation creates a temporary DataFrame with a single row containing a True value for each group in df, and then explode()'s that temporary dataframe. Finally, it groups by count and aggregates to result df

input

--

| group | count | ... 
|   A   |   2   |     
|   B   |   0   |     
|   C   |   4   |     
|   D   |   1   |     

And i need to fill new DataFrame with this values randomly (c-(columns) value is dynamic same as names)

expected output

--

A B C D
NaN NaN True True
True NaN True NaN
NaN NaN NaN NaN
NaN NaN True NaN
True NaN True NaN

I think it's possible to add a randomized set of length from 1 to r and after expanding and etc. just agg(sum) by this values.

my code

--

inputs = [
    {"group": "A", "count": 2},
    {"group": "B", "count": 0}, 
    {"group": "C", "count": 4}, 
    {"group": "D", "count": 1}, 
    ]
df = pd.DataFrame(inputs)

def expand(count:int, group: str) -> pd.DataFrame:
    """expands DF by counts"""
    count = int(round(count))
    df1 = pd.DataFrame([{group: True}])
    # I'm thinking here i need to add random seed
    df1 = df1.assign(count = [list(range(1, count+1))])\
             .explode('count')\
             .reset_index(drop=True)
    return df1

def creator(df: pd.DataFrame) -> pd.DataFrame:
    """create new DF for every group value(count)"""
    dfs = [expand(r, df['group'].values[0]) for r in list(df['count'].values)]
    df = pd.concat(dfs, ignore_index=True)
    return df
    
df.groupby('group', as_index=False)\
    .apply(creator)\
    .drop('count', axis=1)\
    # and groupby my seed
    .groupby(level=1)\
    .agg(sum)

I tried to declare my questions if it will be helpful:

  1. Is there any method in pandas to make this easy/better?
  2. How can I make random counts and assign them in the expand() function?
  3. Is it a way to create sized DataFrame with NaN and then just drop there my values randomly(like pd.where or something)?

PS: This is my first time asking a question, so I hope I have provided enough information!


Solution

  • A pure pandas solution would be to use sample :

    out = pd.DataFrame(
        {g: [True]*c + [np.nan]*(R-c) for g, c in df.to_numpy()}
    ).sample(frac=1)
    

    Output :

    print(out)
    
          A   B     C     D
    0  True NaN  True   NaN
    1   NaN NaN  True   NaN
    2  True NaN   NaN   NaN
    3   NaN NaN  True  True
    4   NaN NaN  True   NaN
    

    Old answer :

    A simple approach would be to bootstrap a pre-null DataFrame while randomly choosing/picking a coordinate [index, column] :

    np.random.seed(0)
    
    R = 5 # <-- rows
    
    out = pd.DataFrame(
        np.nan, index=range(R), columns=list(df["group"])
    )
    
    for g, c in df.to_numpy():
        out.loc[np.random.choice(out.index, c, replace=False), g] = True