Search code examples
rdplyrsampling

Create numeric samples based on multiple conditions of multiple vectors


Given the following data frame:

df <- tibble::tribble(
  ~pass_id, ~km_ini, ~km_fin,
        1L,    0.89,    2.39,
        2L,    1.53,    3.03,
        3L,    21.9,    23.4,
        4L,    23.4,    24.9,
        5L,      24,    25.5,
        6L,    25.9,    27.4,
        7L,    36.7,    38.2,
        8L,    41.4,    42.9,
        9L,    42.1,    43.6,
       10L,    45.5,      47
  )

Created on 2020-02-17 by the reprex package (v0.3.0)

I need a sample of any 50 numbers that match the all following criteria for all the rows of df:

  1. >= .750
  2. <= 99.450
  3. < km_ini - .750
  4. > km_fin + .750

My best shot is far away from what I am expecting. First I did a runif, then I enframed it, and tried to filter, but I just got it working for the first two conditions. Anyway, I don't necessarily need the result as a data frame, it could be a numeric vector.

library(tidyverse, verbose = F)

set.seed(42)
sort(runif(100000, 0, 99.450)) %>% 
  enframe(., "ID", "km") %>% 
  filter(km >= .750 & km <= 99.450 - .750)
#> # A tibble: 98,467 x 2
#>       ID    km
#>    <int> <dbl>
#>  1   763 0.750
#>  2   764 0.751
#>  3   765 0.751
#>  4   766 0.753
#>  5   767 0.753
#>  6   768 0.754
#>  7   769 0.754
#>  8   770 0.755
#>  9   771 0.755
#> 10   772 0.757
#> # … with 98,457 more rows

EDIT: trying to visually display the problem

The final result needs to be a numerical vector (or df) that assesses the entire data set, not just each row separately. As an example for the first two lines, see the following representation:

enter image description here

So, see that:

  • The black line indicates that I cannot have data smaller than .750.
  • The blue line indicates where I cannot have records due to the coverage area of km_ini and km_fin (arrows) of line 1 and one more appendix considering the area of + or - .750 (between the arrows and the points).
  • The red line indicates where I cannot have records due to the coverage area of km_ini and km_fin (arrows) of line 2 and another appendix considering the area of + or - .750 (between the arrows and the points).

This way, right away, the random set of data within the first 4000 meters, could only have numbers from 3030 + .750.

The question, then, is to try to do this programmatically so that all lines of the data frame are evaluated and the numbers generated are not within all the conditions mentioned.


Solution

  • I think I understand. You want to sample in the gaps delimited by the distances, with the complicating factor that you cannot sample on either side of the marked distances for 750m.

    I think it is useful to get a clearer visual understanding of the problem. In this plot, the x axis represents distance (the y axis is just a "dummy" axis, since we are only interested in the x axis). The black bars are the "exclusion zones" in which we cannot sample. There are also 750m zones on either side of the exclusion zones in which we do not want to sample, which are here coloured red:

    enter image description here

    So essentially, we want a uniform sample from the non-shaded areas of the x axis in this plot.

    My solution is to first merge the overlapping segments, then create a sample space that is weighted according to the size of each gap and take 50 uniform samples from that space.

    Here, I have generalized to allow arbitrary limits and sample size.

    sample_space <- function(km_ini, km_fin, km_max = 99.45, buffer = 0.75, n = 50)
    {
      # Find and merge overlaps
      i <- 1
      km_ini <- km_ini - buffer
      km_fin <- km_fin + buffer
      while(i <= length(km_ini))
      {
        overlaps <- which(km_ini < km_fin[i] & km_fin > km_ini[i])
        if(length(overlaps) < 2) {i <- i + 1; next;}
        km_ini <- c(km_ini, min(km_ini[overlaps]))
        km_fin <- c(km_fin, max(km_fin[overlaps]))
        km_ini <- km_ini[-overlaps]
        km_fin <- km_fin[-overlaps]
        i <- 1
      }
    
      # Create a matrix of appropriate gaps
      gaps <- cbind(sort(km_fin), c(sort(km_ini)[-1], km_max))
    
      print(gaps)
      # Create a weighted sample space
      splits <- c(0, cumsum(apply(gaps, 1, diff)))
    
      # Take a sample within that space
      runifs <- runif(n, 0, max(splits))
    
      # Add the appropriate starting value back on
      index <- as.numeric(cut(runifs, splits))
      runifs - splits[index] + gaps[index, 1]
    }
    

    So now we can do

    sample_space(df$km_ini, df$km_fin)
    #>  [1] 93.107858 92.216660 83.597703 86.341198 72.258245 86.591883 18.572744
    #>  [8] 16.641163 73.344658 73.075426 78.230074 97.745802 52.654342 52.298444
    #> [15] 70.029034 67.430346 95.328900 62.250864 79.144025 86.344868  7.063474
    #> [22] 58.797335 79.304272 54.731057 32.137068 84.837576 94.226992 50.808135
    #> [29] 65.987277 76.666933 29.225744 33.309866 13.013735  6.925277 65.207383
    #> [36] 91.591293 61.614993 18.646062 97.550237 48.478875 12.860920 20.263471
    #> [43] 34.980616 50.583291 15.813562 96.104448 91.310377 53.063613 17.376281
    #> [50] 72.511153
    

    To show that this does what we wanted it to, let's plot the sample over the plot of exclusion zones:

    set.seed(69)
    sample_df <- data.frame(x = sample_space(df$km_ini, df$km_fin),
                          y = runif(50, 0.45, 0.55))
    
    ggplot(df) + 
      geom_rect(aes(xmin = km_ini - 0.75, xmax = km_fin + 0.75, ymin = 0, ymax = 1), 
                alpha = 0.5, fill = "red") +
      geom_rect(aes(xmin = km_ini, xmax = km_fin, ymin = 0, ymax = 1), fill = "black") +
      geom_rect(aes(xmin = 0, xmax = 0.75, ymin = 0, ymax = 1), alpha = 0.5) +
      geom_rect(aes(xmin = 99.45, xmax = 100, ymin = 0, ymax = 1), alpha = 0.5) +
      labs(x = "distance", y = "dummy") +
      geom_point(data = sample_df, aes(x = x, y = y), col = "blue")
    

    enter image description here

    Created on 2020-03-01 by the reprex package (v0.3.0)