I need to sample a data frame maintaining all levels of factors in the outcome. I then want to get the complement of this sample–i.e., those rows that aren't part of the sample. My end goal is to create both a training and a test sample for regression analyses. To do that successfully, I need to ensure that all levels of the factor variables are represented on the training sample.
The approach I've tried (sample code below) was using dplyr::group_by combined with dplyr::slice_sample and then dplyr::anti_join to obtain the test sample. It's not working, for some reason. Either I'm missing something about how these functions are supposed to work or they're not behaving as expected.
I've also tried approaches based on this question. They didn't work because (1) I need to guarantee that all levels of multiple factors are represented and (2) i want to select a proportion of the observations, not a specific number.
> library(tidyverse)
>
> set.seed(72)
>
> data <- tibble(y = rnorm(100), x1 = rnorm(100),
+ x2 = sample(letters, 100, T), x3 = sample(LETTERS, 100, T))
> data
# A tibble: 100 x 4
y x1 x2 x3
<dbl> <dbl> <chr> <chr>
1 1.37 -0.737 c C
2 1.16 1.66 c T
3 0.0344 -0.319 q P
4 1.03 -0.963 k C
5 0.636 0.961 i H
6 0.319 0.761 g L
7 0.216 0.860 u M
8 1.31 0.887 g M
9 -0.594 2.70 m I
10 -0.542 0.517 u C
# … with 90 more rows
>
> train_data <- data %>%
+ group_by(x2, x3) %>%
+ slice_sample(prop = .7)
> train_data # clearly this is not what I want
# A tibble: 8 x 4
# Groups: x2, x3 [8]
y x1 x2 x3
<dbl> <dbl> <chr> <chr>
1 1.23 -0.297 c A
2 1.11 0.689 e O
3 0.559 0.353 e Z
4 -1.65 -1.71 l M
5 -0.777 1.31 l X
6 0.784 0.309 s E
7 0.755 -0.362 u X
8 -0.768 0.292 v H
>
> test_data <- data %>%
+ anti_join(train_data)
Joining, by = c("y", "x1", "x2", "x3")
> test_data # my goal was that the training data would have 70% and the test data would have around 30% of the full sample.
# A tibble: 92 x 4
y x1 x2 x3
<dbl> <dbl> <chr> <chr>
1 1.37 -0.737 c C
2 1.16 1.66 c T
3 0.0344 -0.319 q P
4 1.03 -0.963 k C
5 0.636 0.961 i H
6 0.319 0.761 g L
7 0.216 0.860 u M
8 1.31 0.887 g M
9 -0.594 2.70 m I
10 -0.542 0.517 u C
# … with 82 more rows
>
> reg <- lm(y ~ x1 + x2 + x3, train_data)
> predict(reg, newdata = test_data) # I obviously still have the same problem
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor x2 has new levels a, b, d, f, g, h, i, j, k, m, n, o, p, q, r, t, w, x, y, z
>
>
There is nothing wrong with your code/approach. You do not have enough observations. There are lot of groups with only 1 row in them, which when sampled with 0.7 proportion rounds it down to 0. If you change the sample to 1000 rows, the same code works fine without error.
library(dplyr)
data <- tibble(y = rnorm(1000), x1 = rnorm(1000),
x2 = sample(letters, 1000, T), x3 = sample(LETTERS, 1000, T))
train_data <- data %>%
group_by(x2, x3) %>%
slice_sample(prop = 0.7)
test_data <- data %>% anti_join(train_data)
reg <- lm(y ~ x1 + x2 + x3, train_data)
predict(reg, newdata = test_data)
If in your real data you have groups with as low as only 1 row, you can sample them such that it selects max
of 1 or (0.7*number of rows in group).
train_data <- data %>% group_by(x2, x3) %>% sample_n(max(0.7*n(), 1))
(Used sample_n
here since I couldn't use n()
in slice_sample
).