Search code examples
rstatistics-bootstrap

Bootstrapping in R (with replacement) whilst retaining dependency for two variables


Background and context

I am new to R, but I have some basic understanding of how to run a bootstrap procedure for individual variables. However, from the online guides I have looked at, the examples that are used only include a single variable and their outcome ends up being a histogram that includes the generated means from all the resampling and the frequency.

I am looking to perform a bootstrap of my sample where my data is dependent on two variables (participant age & test score). I understand how I could bootstrap my variables independently so that I can bootstrap age or score, but given that participants of the same age sometimes get different scores, I am not sure how I would be able to determine which score corresponds with the age that is bootstrapped.

For example, a 20-year-old participant has a score of 50, and a second 20-year-old has a score of 70, and these are within my data. If I were to run a bootstrap with replacement based on age, it is possible that one of the 20-year-olds will be selected and replaced back into the dataset. However, I do not know what their corresponding score would be - i.e., I do not know whether the one who scored 50 or the one who scored 70 was selected.

Others I have asked mention I might need to extract age and score together, corresponding to a single row, to retain the dependency between the two. The data file I have on R is a row for each participant, with age in one column and score in another.

What am I looking for?

The end goal of the bootstrapping is to resample (with replacement) my data 200 times to give me 200 "different" sets of data, which I can put into a quadratic function to determine the vertex of the graph. These 200 values will be combined to generate a mean and standard error.

Having little experience with R coding, I have not tried a great deal other than understanding the basics of bootstrapping (with replacement).

I am aware that it is possible to mutate/merge data, but I do not believe it fits with this. I am not sure of how to proceed, and any support (sources of information or where I can look etc.) would be greatly appreciated.


Solution

  • You could run the resampling on the indices.

    For example:

    set.seed(1)
    df <- data.frame(age = rep( seq(20,50,10), each=2), score = sample(50:70, 8))
    
      age score
    1  20    68
    2  20    62
    3  30    53
    4  30    67
    5  40    66
    6  40    55
    7  50    51
    8  50    65
    

    Resample:

    df[sample( seq_len(nrow(df) ), nrow(df), replace = TRUE), ]
    
        age score
    6    40    55
    4    30    67
    1    20    68
    7    50    51
    1.1  20    68
    5    40    66
    1.2  20    68
    3    30    53