Search code examples
pythonpandaspython-multiprocessingpython-multithreadingfaker

Parallelize dummy data generation in pandas


I would like to generate a dummy dataset composed of a fake first name and a last name for 40 milion records using multiple processor n cores.

Below is a single task loop that generates a first name and a last name and appends them to a list:

import pandas as pd
from faker import Faker

def fake_data_generation(records):
    fake = Faker(['en_US','en_GB'])
    
    person = []
    
    for i in range(records):
        first_name = fake.first_name()
        last_name = fake.last_name()
        person.append({"First_Name": first_name,
                       "Last_Name": last_name}
                     )
    return person

Output:

for i in range(5):
    df = pd.DataFrame(fake_data_generation(i))

>>> df
  First_Name Last_Name
0      Colin   Stewart
1    Barbara      Rios
2     Victor     Green
3  Stephanie     Booth

Solution

  • Maybe you can use providers directly:

    import pandas as pd
    import numpy as np
    from faker.providers.person.en_US import Provider as us
    from faker.providers.person.en_GB import Provider as gb
    
    first_names = list(set(us.first_names).union(gb.first_names))
    last_names = list(set(us.last_names).union(gb.last_names))
    
    N = 40_000_000
    df = pd.DataFrame({'First_Name': np.random.choice(first_names, N),
                       'Last_Name': np.random.choice(last_names, N)})
    

    Output:

    >>> df
             First_Name Last_Name
    0             Kayla      Tran
    1              Gary     Bates
    2             Daisy   Leblanc
    3           Tiffany     Ahmed
    4            Kellie       May
    ...             ...       ...
    39999995   Kristine   Collier
    39999996      Joyce     Mccoy
    39999997       Paul   Padilla
    39999998      Tonya     Bevan
    39999999      Julie    Bright
    
    [40000000 rows x 2 columns]