I am trying to optimize (vectorize?) the creation of a monte-carlo style simulation and I have not been able to figure out how to create nested-weighted random values using numpy
or similar libraries. Consider the scenario, inspired by an interviewqs
question: "Students in three classrooms have to vote for one of two class president candidates. Classroom A has 40% of the students and are split 50/50 on candidate X and Y. B has 25% of the students and is split 60/40. C has 35% of the students and is split 35/65"
Creating that data with vanilla Python might look something like this,
import random
nsimulations = 10_000_000
def choose_classroom():
"returns A, B, or C based on percentages"
value = random.random()
if value < .4:
return 'A'
elif value < .65:
return 'B'
else:
return 'C'
def choose_support(classroom):
"return X or Y based on support percentage by classroom"
value = random.random()
if classroom == 'A':
return "X" if value < .5 else "Y"
elif classroom == 'B':
return "X" if value < .6 else "Y"
elif classroom == 'C':
return "X" if value < .35 else "Y"
results = []
for i in range(nsimulations):
classroom = choose_classroom()
support = choose_support(classroom)
results.append({'classroom': classroom, 'support': support})
It takes about 10 seconds to run the 10M simulations on my machine. I'd like to improve that time. The first thing I looked at was numpy.random.choice, fast_classrooms = np.random.choice(['A', 'B', 'C'], size=nsimulations, p=[.4, .25, .35])
. That does execute quickly, about 350ms. However I don't know how to generate the follow-on X
/Y
distributions from there.
One thing I've tried is Pandas apply
, and it seems like it is doing some kind of optimizations under the covers. The below Pandas snippet takes ~2.5 seconds to run while a list comprehension ([choose_support(record) for record in fast_classrooms]
takes ~4 seconds. Still, it feels like this isn't the "right" way to do things.
import pandas
import numpy as np
fast_classrooms = np.random.choice(['A', 'B', 'C'], size=nsimulations, p=[.4, .25, .35])
df = pandas.DataFrame({'classroom': fast_classrooms})
df['support'] = df.classroom.apply(choose_support)
What I had hoped to find is something that could generate nested weighted distributions, something like - np.random.choice([['A', 'B', 'C'], ['X', 'Y']], p=[[.4, .25, .35], [[.5, .5], [.6, .4], [.35, .65]]])
What are some ways of going about generating this data?
You can significantly reduce the number of python looping, making the code more vectorized:
import numpy as np
nsimulations = 12
uniquerooms = ['A','B','C']
supportprobs = [0.5, 0.6, 0.35]
classrooms = np.random.choice(uniquerooms, size=nsimulations, p=[.4, .25, .35])
supports = np.empty_like(classrooms, dtype=classrooms.dtype)
for room, prob in zip(uniquerooms, supportprobs):
mask = classrooms == room
supports[mask] = np.random.choice(['X','Y'], size=mask.sum(), p=[prob, 1-prob])
print(classrooms)
# ['C' 'A' 'C' 'A' 'C' 'C' 'A' 'C' 'B' 'C' 'B' 'A']
print(supports)
# ['X' 'Y' 'Y' 'Y' 'Y' 'X' 'Y' 'X' 'X' 'X' 'Y' 'X']