Search code examples
rsubsettrim

How to subset rows in a dataframe in R according to mean of variable?


I have a dataframe in R with 120 observations (participants). The mean age of all the sample is 51 years old (range 25-90). I would like to randomly select 60 of these observations to have a mean of 40. Is there a way of doing this? I would prefer to avoid manual trimming to avoid the issues that could come from that.

I appreciate any help that can be provided!


Solution

  • If you are constraining your sample to have a particular mean then it isn't a truly random sample. However, there are various ways to do this, none of which are easy. It depends on the distribution of ages in your sample, which of course I don't have.

    Anyway, the following data frame will be somwhat similar to yours:

    set.seed(772)
    df <- data.frame(age = sample(25:90, 120, T), ID = factor(1:120))
    

    We can see it has ages with the right range and about the right mean:

    range(df$age)
    #> [1] 25 90
    mean(df$age)
    #> [1] 51.23333
    

    Now to get your sample ages to average 40, you will need to sample preferentially from the younger group. First we'll find the indices of the "old" and "young" participants:

    young <- which(df$age <= 40)
    old   <- which(df$age > 40)
    

    Now we need to just try lots of samples (via a loop) until the mean is close to 40. To do this without completely truncating the older ages, we will take a 2:1 ratio of young to old participants for each sample. To do this, you'll need at least 40 participants under 40 in your data, which I'm guessing you do have.

    seed <- 1
    
    while(TRUE)
    {
      set.seed(seed)
      young_indices <- young[sample(length(young), 40)]
      old_indices   <- old[sample(length(old), 20)]
      indices       <- c(young_indices, old_indices)
    
      if(abs(mean(df$age[indices]) - 40) < 0.25) break
    
      seed <- seed + 1
    }
    
    sample_df <- df[indices,]
    

    Now sample_df will contain 60 unique participants whose average age is about 40;

    nrow(sample_df)
    #> [1] 60
    mean(sample_df$age)
    #> [1] 40.1