Search code examples
rtidymodelstrain-test-splitrsample

How to propotionally split data using initial_split r


I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.

dat <- as_tibble(seq(1:100))

split <- inital_split(dat, prop = 0.5, breaks = 50)

testing <- testing(split)

When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.

I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.

Have I miss understood the breaks call function?


Solution

  • There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:

    library(rsample)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    
    dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2))) 
    dat
    #> # A tibble: 100 × 2
    #>    value strat
    #>    <int> <fct>
    #>  1     1 1    
    #>  2     2 1    
    #>  3     3 2    
    #>  4     4 2    
    #>  5     5 3    
    #>  6     6 3    
    #>  7     7 4    
    #>  8     8 4    
    #>  9     9 5    
    #> 10    10 5    
    #> # … with 90 more rows
    
    split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
    #> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
    #> • Consider increasing `pool` to at least 0.1
    split
    #> <Analysis/Assess/Total>
    #> <50/50/100>
    
    training(split) %>% arrange(strat)
    #> # A tibble: 50 × 2
    #>    value strat
    #>    <int> <fct>
    #>  1     1 1    
    #>  2     4 2    
    #>  3     5 3    
    #>  4     8 4    
    #>  5    10 5    
    #>  6    12 6    
    #>  7    13 7    
    #>  8    16 8    
    #>  9    17 9    
    #> 10    20 10   
    #> # … with 40 more rows
    testing(split) %>% arrange(strat)
    #> # A tibble: 50 × 2
    #>    value strat
    #>    <int> <fct>
    #>  1     2 1    
    #>  2     3 2    
    #>  3     6 3    
    #>  4     7 4    
    #>  5     9 5    
    #>  6    11 6    
    #>  7    14 7    
    #>  8    15 8    
    #>  9    18 9    
    #> 10    19 10   
    #> # … with 40 more rows
    

    Created on 2022-02-22 by the reprex package (v2.0.1)

    We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.