Search code examples
rfiltergroup-bysubset

Generating new subsets of data based on a sequence of values as a cut-off


I have a large dataset with pressure data. I want to be able to create multiple sets of data that is filtered at various values (i.e. > 3500, > 3600, etc.) and then run a couple of analyses on each of these new sets of data that are cutoff as the specified value.

So for instance, this might be similar to what I currently do:

#making a reproducible example
pressure <- runif(30, min = 3750, max = 4500)
value <- runif(30, min = 0, max = 50)
stage <- rep(c(1, 2), each = 15)

raw.data <- data.frame(pressure, value, stage)

#set a cutoff point
cutoff.press <- 3750

#make a new dataset
cutoff <- raw.data[raw.data$pressure > cutoff.press,]

#run an analysis
analysis <- cutoff %>% 
group_by(stage) %>%
summarize(
   MinValue = min(value),
   MaxValue = max(value)
)

Is there a way to do this without having to create multiple individual sets of data for each cut-off value of interest and then running each of the analyses individually?

Like if I wanted to test multiple pressure cutoff values, such as seq(3750, 4000, 50), I don't want to have to repeat the process above for each of the values generated in the sequence.

I've thought about using dplyr with the filter() function and setting a bunch of values by hand, but not only would that be time consuming, I am not sure that would allow me to have multiple datasets to do the analyses on.


Solution

  • If you have a lot of different iterations that you would like to run, then using purrr would be a good option too, as you could do it all in one pipe.

    library(tidyverse)
    
    purrr::pmap(data.frame(pressure = seq(3750, 4000, 50)),
                ~ dplyr::filter(raw.data, pressure > ..1)) %>%
      purrr::map(. %>%
                   group_by(stage) %>%
                   summarize(MinValue = min(value),
                             MaxValue = max(value))) %>%
      # If you want to set the names to the cutoff values.
      setNames(seq(3750, 4000, 50))
    

    Output

    $`3750`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    3.52      46.6
    2     2    0.575     49.3
    
    $`3800`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    3.52      46.6
    2     2    0.575     47.5
    
    $`3850`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    3.52      46.6
    2     2    0.575     47.5
    
    $`3900`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    3.52      46.6
    2     2    0.575     47.5
    
    $`3950`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    3.52      46.6
    2     2    0.575     47.5
    
    $`4000`
    # A tibble: 2 × 3
      stage MinValue MaxValue
      <dbl>    <dbl>    <dbl>
    1     1    6.65      46.6
    2     2    0.575     47.5
    

    Data

    raw.data <- structure(list(pressure = c(4160.41269886773, 4044.58961030468, 
                                4336.48418885423, 3762.11064029485, 4235.55055609904, 3926.50744639104, 
                                4086.0048676841, 4360.64667999744, 3850.74476944283, 3950.07681293646, 
                                4347.61320002144, 3996.32209626725, 4262.53829378402, 3869.30528597441, 
                                4252.7681372012, 4013.94325762521, 4275.64664371312, 4197.37908616662, 
                                4231.71574808657, 4028.1643497292, 4407.9091984313, 4481.91399103962, 
                                4353.40271308087, 4013.09538848, 4109.39885408152, 4195.05179609405, 
                                4222.33691916335, 4316.15335500101, 3860.02388742054, 3772.72424055263
    ), value = c(46.6360261081718, 19.0778955002315, 9.46381011744961, 
                 17.4791521392763, 6.64818733930588, 3.79822270479053, 17.0007253182121, 
                 45.9705576649867, 39.6164933103137, 3.52405618177727, 29.9587145447731, 
                 10.8624027809128, 45.8421137067489, 34.4845326268114, 17.0537169324234, 
                 47.0035993610509, 29.5542735257186, 12.992845242843, 32.0275551988743, 
                 21.112488291692, 12.7272683312185, 23.9938693121076, 18.5264392290264, 
                 42.9235454765148, 0.575024168938398, 10.7687710318714, 0.992469629272819, 
                 47.4592371145263, 40.4172958689742, 49.3020136258565), 
    stage = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
              2, 2, 2, 2, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA, -30L))