Let x be a dataset with 5 variables and 15 observations:
age gender height weight fitness
17 M 5.34 68 medium
23 F 5.58 55 medium
25 M 5.96 64 high
25 M 5.25 60 medium
18 M 5.57 60 low
17 F 5.74 61 low
17 M 5.96 71 medium
22 F 5.56 75 high
16 F 5.02 56 medium
21 F 5.18 63 low
20 M 5.24 57 medium
15 F 5.47 72 medium
16 M 5.47 61 high
22 F 5.88 73 low
18 F 5.73 62 medium
The frequencies of the values for the fitness variable are as follows: low = 4, medium = 8, high = 3.
Suppose I have another dataset y with the same 5 variables but 100 observations. The frequencies of the values for the fitness variable in this dataset are as follows: low = 42, medium = 45, high = 13.
Using R, how can I obtain a representative sample from y such that the sample fitness closely matches the distribution of the fitness in x?
My initial ideas were to use the sample function in R and assign weighted probabilities for the prob argument. However, using probabilities would force an exact match for the frequency distribution. My objective is to get a close enough match while maximizing the the sample size.
Additionally, suppose I wish to add another constraint where the distribution of the gender must also closely match that of x?
Consider using rmultinom
to prepare samples counts in each level of fitness.
Prepare the data (I have used y
preparation from @Edward answer)
x <- read.table(text = "age gender height weight fitness
17 M 5.34 68 medium
23 F 5.58 55 medium
25 M 5.96 64 high
25 M 5.25 60 medium
18 M 5.57 60 low
17 F 5.74 61 low
17 M 5.96 71 medium
22 F 5.56 75 high
16 F 5.02 56 medium
21 F 5.18 63 low
20 M 5.24 57 medium
15 F 5.47 72 medium
16 M 5.47 61 high
22 F 5.88 73 low
18 F 5.73 62 medium", header = TRUE)
y <- data.frame(age=round(rnorm(100, 20, 5)),
gender=factor(gl(2,50), labels=LETTERS[c(6, 13)]),
height=round(rnorm(100, 12, 3)),
fitness=factor(c(rep("low", 42), rep("medium", 45), rep("high", 13)),
levels=c("low","medium","high")))
Now the sampling procedure: UPD: I have changed the code for two variables case (gender and fitness)
library(tidyverse)
N_SAMPLES = 100
# Calculate frequencies
freq <- x %>%
group_by(fitness, gender) %>% # You can set any combination of factors
summarise(freq = n() / nrow(x))
# Prepare multinomial distribution
distr <- rmultinom(N_SAMPLES, 1, freq$freq)
# Convert to counts
freq$counts <- rowSums(distr)
# Join y with frequency for further use in sampling
y_count <- y %>% left_join(freq)
# Perform sampling using multinomial distribution counts
y_sampled <- y_count %>%
group_by(fitness, gender) %>% # Should be the same as in frequencies calculation
# Check if count is greater then number of observations
sample_n(size = ifelse(n() > first(counts), first(counts), n()),
replace = FALSE) %>%
select(-freq, -counts)