I've been working to create a Monte Carlo simulation that will run through each ID of my dataframe and yield their respective Means and Standard Deviations. I've been able to write the code to get it for any one ID, but not to iterate through the entire list of IDs in my dataframe. So I could write each line individually, but I need the code to iterate through any mutable list of IDs.
Here, I've tried to create a list of lists in which each set of Monte Carlo observations can be stored (and the mean and std could be taken from). I don't believe that will be the most efficient way of coding this, but it's what I know at this point. Is there anyway to run the Monte Carlo simulation on each of the IDs (without specifically calling each)? I need to be able to add and remove various IDs and corresponding data from the list.
This is a follow up on: Utilizing Monte Carlo to Predict Revenue in Python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ID = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Revenue = [1000, 1200, 1300, 100 ,500, 0, 800, 950, 4321, 800, 1000, 1200, 1300, 100 ,500, 0, 800, 950, 4321, 800]
odds = [0.5, 0.6, 0.33, 0.1, 0.9, 0.87, 0.37, 0.55, 0.97, 0.09, 0.5, 0.6, 0.33, 0.1, 0.9, 0.87, 0.37, 0.55, 0.97, 0.09]
d = {'ID': ID, 'Revenue': Revenue, 'Odds': odds}
df = pd.DataFrame(d)
df['Expected Value'] = df['Revenue']*df['Odds']
print(df)
num_samples = 100
df['Random Number'] = np.random.rand(len(df))
def monte_carlo_array(df):
for _ in range(len(df)):
yield []
mc_arrays = list(monte_carlo_array(df))
# Fill each list with 100 observations (no filtering necessary)
id_1 = []
filter_1 = (df['ID'] == 5)
for _ in range(num_samples):
sample = df['Revenue'] * np.where(np.random.rand(len(df)) < \
df['Odds'], 1, 0)
for l in monte_carlo_array(df):
for i in l:
mc_arrays[i].append(sample.sum())
id_1.append(sample.loc[filter_1].sum())
# Plot simulation results.
n_bins = 10
plt.hist([id_1], bins=n_bins, label=["ID: 1"])
plt.legend()
plt.title("{} simulations of revenue".format(num_samples))
print(mc_arrays)
df['Monte Carlo Mean'] = np.mean(mc_arrays[0])
print(df['Monte Carlo Mean'])
IIUC, this is what you're going for:
ID
), you want a total of num_samples
Monte Carlo simulations of whether that row achieves its Revenue
. Revenue
is by comparing a randomly drawn value in [0,1]
against the Odds
for that row (in standard Monte Carlo fashion). Revenue
for each row, across all samples. If so, you can do this by leveraging the sampling function of the Binomial distribution, instead of drawing from a Uniform and then filtering based on Odds
. I'll post a solution using this approach at the end of this answer.
But since you've started out with the Uniform-draw approach: I'd recommend first making a sampling matrix s_draws
of n_rows = len(df)
by num_samples
(aka n_draws
in my code below). Then apply the Odds
check against all of the samples in each row of s_draws
. Then multiply by Revenue
, and take the mean and sd for each row. Like this:
First, draw the sampling matrix:
np.random.seed(42)
n_rows = len(df)
n_draws = 5
s_draws = pd.DataFrame(np.random.rand(n_rows, n_draws))
# the matrix of random values between [0,1]
# note: only showing the first 3 rows for brevity
s_draws
0 1 2 3 4
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
...
Now figure out which sampled instances "achieved" the target Revenue
:
s_rev = s_draws.apply(lambda col: col.lt(df.Odds) * df.Revenue)
# the matrix of sampled revenue
s_rev
0 1 2 3 4
0 1000 0 0 0 1000
1 1200 1200 0 0 0
2 1300 0 0 1300 1300
...
Finally, compute summary statistics for each row/ID
:
s_result = pd.DataFrame({"avg": s_rev.mean(axis=1), "sd": s_rev.std(axis=1)})
# the summary statistics of each row of samples
s_result
avg sd
0 400.0 547.722558
1 480.0 657.267069
2 780.0 712.039325
...
And here's the version using Binomial sampling:
draws = pd.DataFrame(
np.random.binomial(n=1, p=df.Odds, size=(n_draws, n_rows)).T
).multiply(df.Revenue, axis=0)
pd.DataFrame({"avg": draws.mean(axis=1), "sd": draws.std(axis=1)})
Note: This all would work a little differently if ID
was repeated across multiple rows in df
. In that case, you could use groupby
and then take the summary statistics. But in your example, ID
is never repeated, so I'll leave the answer as-is for now.