Search code examples
rloopsnlmesubsampling

Is there a way to create a loop where I provide a function and dataframe and subsample it, and repeat the function with a subsample N times?


I am not sure what the correct word for this would be, so apologies for getting the terminology horribly wrong. Basically I have about 1000 datapoints, and I want to randomly subsample 100 data points 999 times and perform the same function (a generalised least squares model) on each subsample, and see how often the correlation would be significant.

I am also adding some more context, in case it helps. My data is in a data frame with various columns, and I am doing a comparing if there is a relationship between altitude and dichromatism, and whether the relationship between the two varies depending on whether dichromatism is measured using a spectrophotometer or human scoring. I also include latitude centroid of species range in these models, so the PGLS for each looks as follows:

PGLS_VO_Score <- gls(Colour_discriminability_Absolute ~ Altitude_Reported*Centroid.Abs, 
                          correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                          data = VO_HumanScores_Merged, method = "ML")

PGLS_Human_Score <- gls(Human_Score ~ Altitude_Reported*Centroid.Abs, 
                        correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                        data = VO_HumanScores_Merged, method = "ML")

And the data frame of VO_Human_Scores_Merged included a columnn for species names, for Human Scores, for spectrophotometer scores, altitude, latitude, and then some transformed values of those (log transformed, etc.) which I did to begin with in case I needed to to transform the data to meet the assumptions of the PGLS.


Solution

  • A pipeline sampling helps to view what can be done here:

    myfun <- function(x) cor(x[[1]], x[[3]])
    set.seed(42)
    replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE) |>
      lapply(myfun)
    # [[1]]
    # [1] -0.8130999
    # [[2]]
    # [1] -0.8633841
    # [[3]]
    # [1] -0.7967049
    # [[4]]
    # [1] -0.901294
    # [[5]]
    # [1] -0.8761853
    

    (My 5 is your 999, my 10 is your 100.)

    The simplify=FALSE is required since otherwise replicate will reduce to a (nested) matrix, not what we want. My myfun is contrived, use whatever function you want.

    The (perhaps only) advantage to breaking it out into two (or more) steps in a pipeline is that if you want to go back to revisit the random sampling, it's much simpler if you save that random sampling. For example,

    set.seed(42)
    sampdat <- replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE)
    lapply(sampdat, myfun)
    # [[1]]
    # [1] -0.8130999
    # [[2]]
    # [1] -0.8633841
    # [[3]]
    # [1] -0.7967049
    # [[4]]
    # [1] -0.901294
    # [[5]]
    # [1] -0.8761853
    

    If you later realize you need to do something else with the sample data (another metric or whatever) and you don't (for time, memory, or convenience) want to have to rerun all of the other sample-aggregations, you can re-use sampdat.