Search code examples
pythonnumpymontecarlo

How to create nested weighted distributions in Python/numpy?


I am trying to optimize (vectorize?) the creation of a monte-carlo style simulation and I have not been able to figure out how to create nested-weighted random values using numpy or similar libraries. Consider the scenario, inspired by an interviewqs question: "Students in three classrooms have to vote for one of two class president candidates. Classroom A has 40% of the students and are split 50/50 on candidate X and Y. B has 25% of the students and is split 60/40. C has 35% of the students and is split 35/65"

Creating that data with vanilla Python might look something like this,

import random

nsimulations = 10_000_000

def choose_classroom():
    "returns A, B, or C based on percentages"
    value = random.random()
    if value < .4:
        return 'A'
    elif value < .65:
        return 'B'
    else:
        return 'C'
        
def choose_support(classroom):
    "return X or Y based on support percentage by classroom"
    value = random.random()
    if classroom == 'A':
        return "X" if value < .5 else "Y"
    elif classroom == 'B':
        return "X" if value < .6 else "Y"
    elif classroom == 'C':
        return "X" if value < .35 else "Y"
        
results = []
for i in range(nsimulations):
    classroom = choose_classroom()
    support = choose_support(classroom)
    results.append({'classroom': classroom, 'support': support})

It takes about 10 seconds to run the 10M simulations on my machine. I'd like to improve that time. The first thing I looked at was numpy.random.choice, fast_classrooms = np.random.choice(['A', 'B', 'C'], size=nsimulations, p=[.4, .25, .35]). That does execute quickly, about 350ms. However I don't know how to generate the follow-on X/Y distributions from there.

One thing I've tried is Pandas apply, and it seems like it is doing some kind of optimizations under the covers. The below Pandas snippet takes ~2.5 seconds to run while a list comprehension ([choose_support(record) for record in fast_classrooms] takes ~4 seconds. Still, it feels like this isn't the "right" way to do things.

import pandas
import numpy as np

fast_classrooms = np.random.choice(['A', 'B', 'C'], size=nsimulations, p=[.4, .25, .35])
df = pandas.DataFrame({'classroom': fast_classrooms})
df['support'] = df.classroom.apply(choose_support)

What I had hoped to find is something that could generate nested weighted distributions, something like - np.random.choice([['A', 'B', 'C'], ['X', 'Y']], p=[[.4, .25, .35], [[.5, .5], [.6, .4], [.35, .65]]])

What are some ways of going about generating this data?


Solution

  • You can significantly reduce the number of python looping, making the code more vectorized:

    import numpy as np
    
    nsimulations = 12
    uniquerooms = ['A','B','C']
    supportprobs = [0.5, 0.6, 0.35]
    classrooms = np.random.choice(uniquerooms, size=nsimulations, p=[.4, .25, .35])
    supports = np.empty_like(classrooms, dtype=classrooms.dtype)
    for room, prob in zip(uniquerooms, supportprobs):
        mask = classrooms == room
        supports[mask] = np.random.choice(['X','Y'], size=mask.sum(), p=[prob, 1-prob])
    
    print(classrooms)
    # ['C' 'A' 'C' 'A' 'C' 'C' 'A' 'C' 'B' 'C' 'B' 'A']
    print(supports)
    # ['X' 'Y' 'Y' 'Y' 'Y' 'X' 'Y' 'X' 'X' 'X' 'Y' 'X']