Search code examples
rloopssimulationdistributionsampling

R: Select vector (numeric) from data frame, sample n=10 subsets of size i=5 and i= 10 within vector and calculate mean for each of these samples


I have the following problem:

  1. Have a data frame, i.e. containing two vectors "Name" and "Values", one as text and one with numeric values, with 20 rows and 2 columns
  2. I want to extract "Values" and sample randomly (with equal weight) 10x a subset of size 5 from the "Values" and calculate the mean. I want to capture those results (mean values) in another vector 10x1.
  3. I want to do the same as step 2, however, instead of sampling a subset of size 5, I want to have more observations, i.e. 15 (from the 20 values). I take those 15 values, calculate the mean an re-iterate this step 10x, logging in the results in a new vector 10x1. (4. Ultimately, I want to compare some descriptive statistics between these two vectors, i.e. expecting that the smaller subset size vector would have fatter tails, more negatively skewed etc).

Creating the data frame as a start

Name <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t")
Values <- c(0.1, 0.05, 0.03, 0.06, -0.1, -0.3, -0.05, 0.5, 0.12, 0.06, 0.04, 0.15, 0.13, 0.16, -0.12, -0.03, -0.5, 0.05, 0.07, 0.03)
data <- data.frame(Name, Values)

The relevant part:

# extract Values column
Values <- data$Values

# define sizes of subset and number of iterations
n_small <- 5
n_large <- 15
n_iterations <- 10

set.seed(123456)

# Initialize result vector
Averages_small <- NULL
Averages_large <- NULL

# Calculate average of the subset and allocate it to the result vector
for (i in n_iterations) {
  Averages_small[i] <- mean(sample(Values, n_small, replace = FALSE))
  Averages_large[i] <- mean(sample(Values, n_large, replace = FALSE))
}

Somehow this gives ma 9x NA and a number. What I am doing wrong? and is there a better way than for-loop this through, because above is an example and also no NA values, however, the original data set has 20k rows and it might "contain" missing values.

fyi, to give you a background: the Values are return figures of investments and the question is having a higher number of investments helps diversification.

Thank you very much for your help!


Solution

  • You can use replicate to get 10 draws of your sample. This returns a matrix with the samples in columns, so the colMeans of this matrix gives you the vector you are looking for:

    set.seed(1) # For reproducibility
    
    vec5  <- colMeans(replicate(10, sample(data$Values, 5)))
    vec15 <- colMeans(replicate(10, sample(data$Values, 15)))
    
    vec5
    #> [1] -0.014  0.148  0.044 -0.026  0.062  0.020 -0.032 -0.130  0.166  0.040
    
    vec15
    #> [1]  0.058000000  0.024666667  0.051333333  0.045333333  0.024000000
    #> [6]  0.010666667  0.022666667 -0.010000000  0.003333333 -0.001333333
    

    You can see that the standard deviation of vec5 is indeed larger:

    sd(vec5)
    #> [1] 0.08711908
    
    sd(vec15)
    #> [1] 0.02297406