Search code examples
pythonpython-3.xpandasmontecarlo

Monte Carlo Simulation for Multiple Entries in Python


I've been working to create a Monte Carlo simulation that will run through each ID of my dataframe and yield their respective Means and Standard Deviations. I've been able to write the code to get it for any one ID, but not to iterate through the entire list of IDs in my dataframe. So I could write each line individually, but I need the code to iterate through any mutable list of IDs.

Here, I've tried to create a list of lists in which each set of Monte Carlo observations can be stored (and the mean and std could be taken from). I don't believe that will be the most efficient way of coding this, but it's what I know at this point. Is there anyway to run the Monte Carlo simulation on each of the IDs (without specifically calling each)? I need to be able to add and remove various IDs and corresponding data from the list.

This is a follow up on: Utilizing Monte Carlo to Predict Revenue in Python

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


ID = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Revenue = [1000, 1200, 1300, 100 ,500, 0, 800, 950, 4321, 800, 1000, 1200, 1300, 100 ,500, 0, 800, 950, 4321, 800]
odds = [0.5, 0.6, 0.33, 0.1, 0.9, 0.87, 0.37, 0.55, 0.97, 0.09, 0.5, 0.6, 0.33, 0.1, 0.9, 0.87, 0.37, 0.55, 0.97, 0.09]

d = {'ID': ID, 'Revenue': Revenue, 'Odds': odds}
df = pd.DataFrame(d)
df['Expected Value'] = df['Revenue']*df['Odds']

print(df)

num_samples = 100
df['Random Number'] = np.random.rand(len(df))

def monte_carlo_array(df):
    for _ in range(len(df)):
        yield []

mc_arrays = list(monte_carlo_array(df))

# Fill each list with 100 observations (no filtering necessary)
id_1 = []
filter_1 = (df['ID'] == 5)

for _ in range(num_samples):
    sample = df['Revenue'] * np.where(np.random.rand(len(df)) < \
                          df['Odds'], 1, 0)
    for l in monte_carlo_array(df):
        for i in l:
        mc_arrays[i].append(sample.sum())
    id_1.append(sample.loc[filter_1].sum())


# Plot simulation results.
n_bins = 10
plt.hist([id_1], bins=n_bins, label=["ID: 1"])
plt.legend()
plt.title("{} simulations of revenue".format(num_samples))

print(mc_arrays)

df['Monte Carlo Mean'] = np.mean(mc_arrays[0])
print(df['Monte Carlo Mean'])

Solution

  • IIUC, this is what you're going for:

    • For each row (which represents an ID), you want a total of num_samples Monte Carlo simulations of whether that row achieves its Revenue.
    • The way you determine whether a given simulated instance achieves its Revenue is by comparing a randomly drawn value in [0,1] against the Odds for that row (in standard Monte Carlo fashion).
    • You want to know the mean and standard deviation of the Revenue for each row, across all samples.

    If so, you can do this by leveraging the sampling function of the Binomial distribution, instead of drawing from a Uniform and then filtering based on Odds. I'll post a solution using this approach at the end of this answer.

    But since you've started out with the Uniform-draw approach: I'd recommend first making a sampling matrix s_draws of n_rows = len(df) by num_samples (aka n_draws in my code below). Then apply the Odds check against all of the samples in each row of s_draws. Then multiply by Revenue, and take the mean and sd for each row. Like this:

    First, draw the sampling matrix:

    np.random.seed(42)
    
    n_rows = len(df)
    n_draws = 5
    s_draws = pd.DataFrame(np.random.rand(n_rows, n_draws))
    
    # the matrix of random values between [0,1]
    # note: only showing the first 3 rows for brevity
    s_draws
               0         1         2         3         4
    0   0.374540  0.950714  0.731994  0.598658  0.156019
    1   0.155995  0.058084  0.866176  0.601115  0.708073
    2   0.020584  0.969910  0.832443  0.212339  0.181825
    ...
    

    Now figure out which sampled instances "achieved" the target Revenue:

    s_rev = s_draws.apply(lambda col: col.lt(df.Odds) * df.Revenue)
    
    # the matrix of sampled revenue
    s_rev
           0     1     2     3     4
    0   1000     0     0     0  1000
    1   1200  1200     0     0     0
    2   1300     0     0  1300  1300
    ...
    

    Finally, compute summary statistics for each row/ID:

    s_result = pd.DataFrame({"avg": s_rev.mean(axis=1), "sd": s_rev.std(axis=1)})
    
    # the summary statistics of each row of samples
    s_result
           avg          sd
    0    400.0  547.722558
    1    480.0  657.267069
    2    780.0  712.039325
    ...
    

    And here's the version using Binomial sampling:

    draws = pd.DataFrame(
        np.random.binomial(n=1, p=df.Odds, size=(n_draws, n_rows)).T
    ).multiply(df.Revenue, axis=0)
    
    pd.DataFrame({"avg": draws.mean(axis=1), "sd": draws.std(axis=1)})
    

    Note: This all would work a little differently if ID was repeated across multiple rows in df. In that case, you could use groupby and then take the summary statistics. But in your example, ID is never repeated, so I'll leave the answer as-is for now.