Search code examples
rlogicsampletapplyrep

creating a dataframe of means of 5 randomly sampled observations


I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.

The code to create the dataframe in question is as follows:

samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
                                          rep(1:1000,rep(5,1000)),
                                          FUN = mean),
           type='mean_of_5')

In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!

For reference, the original dataset loans_income simply has a single column of 50,000 income values.


Solution

  • You have 50,000 loans_income in a single vector. Let's break your code down:

    tapply(sample(loans_income$income,1000*5),
           rep(1:1000,rep(5,1000)),
           FUN = mean)
    

    I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.

    1. sample(loans_income$income,1000*5)
      We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
    > sample(runif(50000),10*5)
     [1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
     [8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
    [15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
    [22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
    [29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
    [36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
    [43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
    [50] 0.85970500
    
    1. rep(1:1000,rep(5,1000))
      Now we are creating an indexing vector of length 50:
    > rep(1:10,rep(5,10))
    [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6
    [29]  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9  9  9 10 10 10 10 10
    

    Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.

    1. FUN = mean
      Just apply the mean-function on the data.

    2. tapply
      So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.

    If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):

    set.seed(1)
    df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
           income index
    1  0.42585569     1
    2  0.16931091     1
    3  0.48127444     1
    4  0.68357403     1
    5  0.99374923     1
    6  0.53227877     2
    7  0.07109499     2
    8  0.20754511     2
    9  0.35839481     2
    10 0.95615917     2
    

    I attached the an index to the random numbers (your income). Now we calculate the mean per group:

    df %>% 
      group_by(index) %>%
      summarise(mean=mean(income))
    

    which gives us

    # A tibble: 10 x 2
       index  mean
       <int> <dbl>
     1     1 0.551
     2     2 0.425
     3     3 0.827
     4     4 0.391
     5     5 0.590
     6     6 0.373
     7     7 0.514
     8     8 0.451
     9     9 0.566
    10    10 0.435
    

    Compare it to

    set.seed(1)
    tapply(sample(runif(5000),10*5),
           rep(1:10,rep(5,10)),
           mean)
    

    which yields basically the same result:

            1         2         3         4         5         6         7         8         9 
    0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460 
           10 
    0.4352546