I am writing a simulation that creates 10,000 periods of 25 sets, with each set consisting of 48 coin tosses. Something in this code is making it run very slowly. It has been running for at least 20 minutes and it is still working. A similar simulation in R runs in under 10 seconds.
Here is the python code I am using:
import pandas as pd
from random import choices
threshold=17
all_periods = pd.DataFrame()
for i in range(10000):
simulated_period = pd.DataFrame()
for j in range(25):
#Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
ignore_index=True, axis=1)
positives = simulated_period[simulated_period==1].count(axis=1)
negatives = simulated_period[simulated_period==-1].count(axis=1)
#Combine positives and negatives that are more than the threshold into single dataframe
sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
sig['total'] = sig['positive'] + sig['negative']
#Add summary of individual simulation to the others
all_periods = pd.concat([all_periods, sig])
If it helps, here is the R script that is running quickly:
flip <- function(threshold=17){
#threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative
outcomes <- c(1, -1)
trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.
summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]
sig.pos <- length(summary$pos[summary$pos>=threshold])
sig.neg <- length(summary$neg[summary$neg>=threshold])
significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg)
return(significant)
}
results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
results <- as.data.frame(results)
Can anyone tell me what I'm running in python that is slowing the process down? Thank you.
Why don't you generate the whole big set
idx = pd.MultiIndex.from_product((range(10000), range(25)),
names=('period', 'set'))
df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)
Took about 120ms on my computer. And then the other operations:
positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')
all_periods = pd.concat( (positives, negatives), axis=1 )
all_periods['total'] = all_periods.sum(1)
take about 600ms extra.