python pandas dataframe optimization simulation

Optimise simulation

I have a simulation using pandas that simulates throws of a random die 12-times (which is one trial) and then stores the probability of success.

My code that calculates the probability of success for any given n trials is given by def Simulation(n):

import numpy as np
import pandas as pd

def Simulation(n):
    """Python function to simulate n rolls of 6-sided die and estimate probability of all faces appearing at least once"""
    if n <= 0: print("Enter a positive integer for number of simulations!")

    # Create an datafrane of results of die throw for n simulations and 12 trials
    rand_int = np.random.randint(1, 7, size=(n, 12))
    randint_df = pd.DataFrame(rand_int)

    # Add column of success (True) and failure (False to dataframe

    # All possible values of trial
    list = [1, 2, 3, 4, 5, 6]

    # Count # of trial outcomes=True/Success
    success = randint_df.apply(lambda row: len(np.setdiff1d(list, row)) == 0, axis=1)
    randint_df['Success'] = success

    # Determine probability of Success
    number_success = randint_df.Success.sum()
    probability = number_success / (n * 12)

    return probability

Next I implement Simulation(n) iterating through a range of series in range(0,100,5) for varying values of n and store the results in a dataframe:

"""Python code to graph probability of success for 6-sided die roll 12-times"""
import numpy as np
import pandas as pd
import sim_func
import matplotlib.pyplot as plt

#Calculate probability for series simulations & series

#Create datatframe for results
result = pd.DataFrame(columns=['Series','Simulation','Probability'])

for series in range(0,100,5):
    for n in range(100,100000,10):
        prob=sim_func.Simulation(n)
        #Create new dataframe of series number & probability of success
        df_new_row=pd.DataFrame({'Series':series,'Simulation':n,'Probability':prob},index=[0])
        result=pd.concat([result,df_new_row],axis=0,ignore_index=True)

result.index.name = "Index"

#Plot changing probability for each series vs number of simulations
result.pivot_table(index='Simulation',columns='Series',values='Probability').plot()

# set y label
plt.ylabel('Probability of Success')
# set y label
plt.xlabel('Number Simulations (n)')
# set title
plt.title('Probability vs Simulations (n) for changing number of Series')

plt.show()

The simulation is taking forever to run and I'm not entirely certain why. I feel like adding the probability result for each n is inefficient and slowly things down but I dont have any ideas how to optimize.

Any insight would be greatly appreciated please.

Used a pandas dataframe to run simulation and store results for varying n-trials. Results are then graphed.

Solution

Here's a modest re-work that I think will help you. A couple of things to point out regarding code and speed.

Don't use keywords as variable names... You were using list for a list. This will cause you a ton of headaches.
You are being a bit inconsistent with your definition of a "trial." I took your intention to be that 1 trial was the p{roll 6 faces of 6 sided die in 12 rolls}. Realize that this is a fixed value and doing more trials will only improve the estimation of this fixed value. Note: I'm getting different results because you were dividing by 12 ??
Regarding speed.... You can simplify your simulation function and just look at each trial on the fly. This is fairly fast and you avoid using large memory block if you try a super-huge number of trials. Avoid making a dataframe here, it is "costly" to make if you are doing a large number of trials. I suppose you could also do some direct math on a 12 x N numpy array pretty efficiently without a dataframe. In your main code, you really want to avoid individually concatenating the rows to a dataframe. It is very slow. Doing it "all at once" with records as I show is the way to go.
Exploring large space. I recommend a power series, or your results get just massive. [1, 10, 100, 1000, 10000, ...] Do you really need to count to 100,000 by tens? I used a power series of 2^n, but there are many options or fractional powers, etc. that could be used.

If your notion of a "trial" is something else, you should be able to modify this code pretty easily to accommodate.

Code

"""Python code to graph probability of success for 6-sided die roll 12-times"""
import numpy as np
import pandas as pd
# import sim_func
import matplotlib.pyplot as plt

# calculate the prob of rolling all 6 faces of 6-sided die in rolls=rolls
def sim2(trials, rolls=12):
    successes = 0
    for t in range(trials):
        # we have all 6 values if set size is 6
        if len(set(np.random.randint(1, 7, rolls))) == 6:  
            successes += 1
    return successes/trials


# hold the results in a list
result = []  

for series in 'ABCDE':
    # a power series is good for exploring a large range of values...
    for n in [2**p for p in range(12)]:
        prob=sim_func.sim2(n)
        # create a record of the result
        res = {'Series': series, 'Trials': n, 'P': prob}
        result.append(res)

# assemble the dataframe "all at once" rather than a bazillion concats
result = pd.DataFrame.from_records(result)

#Plot changing probability for each series vs number of simulations
result.pivot_table(index='Trials',columns='Series',values='P').plot()

# set y label
plt.ylabel('Probability of Success')
# set y label
plt.xlabel('Number Simulations (n)')
# set title
plt.title('Probability vs Simulations (n) for changing number of Series')

plt.show()

Optimise simulation

Code

Output