Search code examples
rdplyrsumsimulationstatistics-bootstrap

Have a bootstrapped data object but want the sum of observations per trial not overall aggregate in rstudio


I have the following data object:

 require(tidyverse)
    sample(x = 0:1, size = 4, replace = TRUE) %>% sum() 

I have created a bootstrap simulation of this code by using the replicate function (we were simulating coin tosses and heads50 is the final data object):

heads50 <- replicate(50, sample(0:1, 4, TRUE)) %>% sum()

However, when I run the sum function it gives me the total aggregate number of heads over all replications of this experiment, not the output of each trial (i.e. how many heads when tossing the coin 4 times per trial is what I want to know, not just the overall number so I can plot probability later on)

I've also created a data object to try to group by possibilities (i.e. to calculate probability of tossing one heads v 2 heads v 3 heads v 4 heads out of four in a trial) like so:

data50 <- tibble(heads = heads50) %>% 
group_by(heads) %>% 
summarise(n = n(), p=n/50)

Problem is it is not doing that when I try to generate a histogram, but just giving me a sum overall probability with one bar:

    ggplot(data50, aes(x = heads, y = p)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(x = "Number of Heads", y = "Probability of Heads in 4 flips(p)") +
  theme_minimal()

Anyone have an idea of how to sum each trial and separate out the possibilities? I have tried to restart rstudio and reload the tidyverse package, which includes dplyr with the 6 core functions.


Solution

  • The fundamental problem here is when you're calling the sum() function. When the sum() is outside replicate(), what happens is that replicate() will make a 4x50 matrix of zeros and ones, and then sum() will just flatten it and add it all up. Instead, what you want is a sum taken on a per-trial basis; we want to do the addition within the replication loop, not outside it. Try:

    heads50 <- replicate(50, sample(0:1, size = 4, replace = T) %>% sum)
    

    Another option would be to sum your matrix only along columns; that is,

    heads50 <- replicate(50, sample(0:1, size = 4, replace = T)) %>% colSums
    

    where this time the colSums() function sits outside the replicate() as it did in your original example.