Search code examples
python-3.xpandasdataframesimulationcoin-flipping

Simulating 10,000 Coinflips in Python Very Slow


I am writing a simulation that creates 10,000 periods of 25 sets, with each set consisting of 48 coin tosses. Something in this code is making it run very slowly. It has been running for at least 20 minutes and it is still working. A similar simulation in R runs in under 10 seconds.

Here is the python code I am using:

import pandas as pd
from random import choices

threshold=17
all_periods = pd.DataFrame()

for i in range(10000):
    simulated_period = pd.DataFrame()
    for j in range(25):
        #Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
        simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
                                      ignore_index=True, axis=1)
        positives = simulated_period[simulated_period==1].count(axis=1)
        negatives = simulated_period[simulated_period==-1].count(axis=1)
        #Combine positives and negatives that are more than the threshold into single dataframe
        sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
        sig['total'] = sig['positive'] + sig['negative']
    #Add summary of individual simulation to the others
    all_periods = pd.concat([all_periods, sig])

If it helps, here is the R script that is running quickly:

flip <- function(threshold=17){
  #threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative

  outcomes <- c(1, -1)
  trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
  trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.

  summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
  summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]

  sig.pos <- length(summary$pos[summary$pos>=threshold])
  sig.neg <- length(summary$neg[summary$neg>=threshold])

  significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg) 

  return(significant)
}

  results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
  results <- as.data.frame(results)

Can anyone tell me what I'm running in python that is slowing the process down? Thank you.


Solution

  • Why don't you generate the whole big set

    idx = pd.MultiIndex.from_product((range(10000), range(25)),
                                     names=('period', 'set'))
    df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)
    

    Took about 120ms on my computer. And then the other operations:

    positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
    negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')
    
    all_periods = pd.concat( (positives, negatives), axis=1 )
    
    all_periods['total'] = all_periods.sum(1)
    

    take about 600ms extra.