Search code examples
rsamplefrequency-distribution

Selecting a sample to match the distribution of variables in another dataset


Let x be a dataset with 5 variables and 15 observations:

age gender  height  weight  fitness
17  M   5.34    68  medium
23  F   5.58    55  medium
25  M   5.96    64  high
25  M   5.25    60  medium
18  M   5.57    60  low
17  F   5.74    61  low
17  M   5.96    71  medium
22  F   5.56    75  high
16  F   5.02    56  medium
21  F   5.18    63  low
20  M   5.24    57  medium
15  F   5.47    72  medium
16  M   5.47    61  high
22  F   5.88    73  low
18  F   5.73    62  medium

The frequencies of the values for the fitness variable are as follows: low = 4, medium = 8, high = 3.

Suppose I have another dataset y with the same 5 variables but 100 observations. The frequencies of the values for the fitness variable in this dataset are as follows: low = 42, medium = 45, high = 13.

Using R, how can I obtain a representative sample from y such that the sample fitness closely matches the distribution of the fitness in x?

My initial ideas were to use the sample function in R and assign weighted probabilities for the prob argument. However, using probabilities would force an exact match for the frequency distribution. My objective is to get a close enough match while maximizing the the sample size.

Additionally, suppose I wish to add another constraint where the distribution of the gender must also closely match that of x?


Solution

  • Consider using rmultinom to prepare samples counts in each level of fitness.

    Prepare the data (I have used y preparation from @Edward answer)

    x <- read.table(text = "age gender  height  weight  fitness
    17  M   5.34    68  medium
    23  F   5.58    55  medium
    25  M   5.96    64  high
    25  M   5.25    60  medium
    18  M   5.57    60  low
    17  F   5.74    61  low
    17  M   5.96    71  medium
    22  F   5.56    75  high
    16  F   5.02    56  medium
    21  F   5.18    63  low
    20  M   5.24    57  medium
    15  F   5.47    72  medium
    16  M   5.47    61  high
    22  F   5.88    73  low
    18  F   5.73    62  medium", header = TRUE)
    
    y <- data.frame(age=round(rnorm(100, 20, 5)), 
                     gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]), 
                     height=round(rnorm(100, 12, 3)), 
                     fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)), 
                                    levels=c("low","medium","high")))
    

    Now the sampling procedure: UPD: I have changed the code for two variables case (gender and fitness)

    library(tidyverse)
    
    N_SAMPLES = 100
    
    # Calculate frequencies
    freq <- x %>%
        group_by(fitness, gender) %>% # You can set any combination of factors
        summarise(freq = n() / nrow(x)) 
    
    # Prepare multinomial distribution
    distr <- rmultinom(N_SAMPLES, 1, freq$freq)
    # Convert to counts
    freq$counts <- rowSums(distr)
    
    # Join y with frequency for further use in sampling
    y_count <- y %>% left_join(freq)
    
    # Perform sampling using multinomial distribution counts
    y_sampled <- y_count %>%
        group_by(fitness, gender) %>% # Should be the same as in frequencies calculation
        # Check if count is greater then number of observations
        sample_n(size = ifelse(n() > first(counts), first(counts), n()),
            replace = FALSE) %>%
        select(-freq, -counts)