I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.
My dataset
Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here
.
I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:
1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples
2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.
3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.
What would be the best solution to split my dataset in the way described above?
Here the short version:
library(tidyverse)
df <- tribble(
~Book_ID, ~sentence_ID, ~Colour, ~Sentences
,1 , 1, "Blue", "Text goes here"
,1 , 2, "Blue", "Text goes here"
,1 , 3, "Blue", "Text goes here"
,2 , 4, "Red", "Text goes here"
,2 , 5, "Red", "Text goes here"
,3 , 6, "Green", "Text goes here"
,4 , 7, "Orange", "Text goes here"
,4 , 8, "Orange", "Text goes here"
,4 , 9, "Orange", "Text goes here"
,4 , 10, "Orange", "Text goes here"
,4 , 11, "Orange", "Text goes here"
,5 , 12, "Blue", "Text goes here"
,5 , 13, "Blue", "Text goes here"
,6 , 14, "Red", "Text goes here"
,6 , 15, "Red", "Text goes here"
)
df %>%
left_join(
df %>%
distinct(Book_ID, Colour) %>%
group_by(Colour) %>%
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
, by = c("Book_ID", "Colour"))
This will give you:
# A tibble: 15 x 5
Book_ID sentence_ID Colour Sentences sub_sample
<dbl> <dbl> <chr> <chr> <int>
1 1 1 Blue "Text goes here" 3
2 1 2 Blue "Text goes here" 3
3 1 3 Blue "Text goes here" 3
4 2 4 Red "Text goes here" 1
5 2 5 Red "Text goes here" 1
6 3 6 Green "Text goes here" 1
7 4 7 Orange "Text goes here" 2
8 4 8 Orange "Text goes here" 2
9 4 9 Orange "Text goes here" 2
10 4 10 Orange "Text goes here" 2
11 4 11 Orange "Text goes here" 2
12 5 12 Blue "Text goes here" 2
13 5 13 Blue "Text goes here" 2
14 6 14 Red "Text goes here" 3
15 6 15 Red "Text goes here" 3
And a short explanation of the code:
Let's start with the nested part
# take the dataframe
df %>%
# ...and extract the distinct combinations of book and colour
distinct(Book_ID, Colour) %>%
# and now for each colour...
group_by(Colour) %>%
# ...provide random numbers from 1 to 4
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
Grouping by colour ensures that you have the same distribution of colours in each sample.
The result of this is now left_join
ed to the original dataframe on the two columns we "distincted" before - which ensures that there can be no duplicates.
One addition
To have the same colour distribution in the subsamples you of course need a sufficient number of books for each colour. So, for example, only 20 different books in green is guaranteed to be differently distributed. In that case you would probably want to "group" colours for the sampling. However, that's a statistics question and clearly beyond the scope of a programming forum.