Search code examples
pythonpandasrandomdynamicfaker

Dynamically create a fake dataset based on the subset of another (real) dataset


I've got a few datasets and for each, I'd like to create a fake dataset that is kind of a representative of that dataset. I need to do it dynamically, only based on the type of data (numeric, obj)

Here's an example

import pandas as pd
import random

# Create a dictionary with columns as lists
data = {
    'ObjectColumn1': [f'Object1_{i}' for i in range(1, 11)],
    'ObjectColumn2': [f'Object2_{i}' for i in range(1, 11)],
    'ObjectColumn3': [f'Object3_{i}' for i in range(1, 11)],
    'NumericColumn1': [random.randint(1, 100) for _ in range(10)],
    'NumericColumn2': [random.uniform(1.0, 10.0) for _ in range(10)],
    'NumericColumn3': [random.randint(1000, 2000) for _ in range(10)],
    'NumericColumn4': [random.uniform(10.0, 20.0) for _ in range(10)]
}

# Create the DataFrame
df = pd.DataFrame(data)

enter image description here

Let's say the above dataset has m (=3) object columns and n (=4) numeric columns. the dataset has x (=10) rows. I'd like to create a fake dataset of N (=10,000) rows, so that:

  1. ObjectColumn1, ObjectColumn2, ..., and ObjectColumn_m in the fake dataset are random selections of entries in ObjectColumn1, ObjectColumn2, ..., and ObjectColumn_m of data respectively
  2. ExtraObjectColumn in the fake dataset is an added fake column, which is a random selection of a list(e.g. list = [ran1, ran2, ran3])
  3. all NumericColumns in the fake data are a randomly generated number that is between the minimum and median of each of those columns in data respectively. for example, NumericColumn1 in the fake data would be a randomly generated data between (3 and 71.5)
  4. I don't want columns to be hard-coded. imagine m and n and x and N are all dynamic. I need to use this on many multiple datasets and the function needs to detect the object and numeric columns and do the above dynamically. The only column that is NOT dynamic is the ExtraObjectColumn, which needs to be given a list to be created from.
  5. Obviously, I need this to be reasonable performance. N is usually a large number (at least 10,000)

here's how fake_data should look like if N = 4

enter image description here


Solution

  • IIUC, something like this should do what you want. It separates the input dataframe into numeric and other columns, then takes random samples as described in the question from those columns, finally adding a list of extra data as a random sample from the supplied list:

    def make_fake_data(df, N, extra):
        df_obj = df.select_dtypes('object')
        obj_out = pd.DataFrame({ col : np.random.choice(df_obj[col], N) for col in df_obj.columns })
        df_num = df.select_dtypes('number')
        num_out = pd.DataFrame({ col : np.random.uniform(np.nanmin(df_num[col]), np.nanmedian(df_num[col]), N) for col in df_num.columns })
        ext_out = pd.DataFrame({ 'ExtraObjectColumn' : random.choices(extra, k=N) })
        return pd.concat([obj_out, num_out, ext_out], axis=1)
    

    Sample usage:

    make_fake_data(df, 20, ['a', 'b', 'c', 'd'])
    

    Sample output:

       ObjectColumn1 ObjectColumn2 ObjectColumn3  ...  NumericColumn3  NumericColumn4  ExtraObjectColumn
    0      Object1_4     Object2_1     Object3_4  ...     1322.269370       14.502498                  d
    1      Object1_6     Object2_5     Object3_5  ...     1314.941227       12.478253                  c
    2      Object1_6     Object2_7     Object3_7  ...     1418.271732       11.214247                  a
    3      Object1_4     Object2_9     Object3_9  ...     1269.408303       11.404303                  c
    4      Object1_3     Object2_6     Object3_4  ...     1426.038132       14.251836                  a
    5      Object1_1     Object2_2     Object3_1  ...     1212.806903       14.750310                  c
    6     Object1_10     Object2_7     Object3_1  ...     1294.254746       10.692256                  d
    7      Object1_1     Object2_7     Object3_3  ...     1232.854020       10.438323                  c
    8      Object1_5     Object2_5     Object3_7  ...     1205.779688       14.763409                  c
    9      Object1_7     Object2_6     Object3_2  ...     1287.248660       10.384493                  b
    10     Object1_4     Object2_2     Object3_1  ...     1237.738855       14.054841                  b
    11     Object1_7     Object2_3     Object3_5  ...     1176.494651       12.869827                  c
    12     Object1_5     Object2_1    Object3_10  ...     1101.036149       10.978762                  b
    13     Object1_5     Object2_6     Object3_7  ...     1430.060873       13.473017                  c
    14     Object1_1     Object2_1     Object3_7  ...     1416.556459       12.281628                  c
    15     Object1_3     Object2_8     Object3_3  ...     1190.239080       15.257389                  b
    16     Object1_6     Object2_9     Object3_5  ...     1101.712808       10.551654                  b
    17     Object1_1    Object2_10     Object3_4  ...     1453.687960       15.070104                  b
    18     Object1_6     Object2_2     Object3_2  ...     1139.413534       11.744450                  b
    19     Object1_7     Object2_7     Object3_2  ...     1080.682206       13.962322                  b