Given the following data frame:
df <- tibble::tribble(
~pass_id, ~km_ini, ~km_fin,
1L, 0.89, 2.39,
2L, 1.53, 3.03,
3L, 21.9, 23.4,
4L, 23.4, 24.9,
5L, 24, 25.5,
6L, 25.9, 27.4,
7L, 36.7, 38.2,
8L, 41.4, 42.9,
9L, 42.1, 43.6,
10L, 45.5, 47
)
Created on 2020-02-17 by the reprex package (v0.3.0)
I need a sample of any 50 numbers that match the all following criteria for all the rows of df
:
>= .750
<= 99.450
< km_ini - .750
> km_fin + .750
My best shot is far away from what I am expecting. First I did a runif
, then I enframe
d it, and tried to filter
, but I just got it working for the first two conditions. Anyway, I don't necessarily need the result as a data frame, it could be a numeric vector.
library(tidyverse, verbose = F)
set.seed(42)
sort(runif(100000, 0, 99.450)) %>%
enframe(., "ID", "km") %>%
filter(km >= .750 & km <= 99.450 - .750)
#> # A tibble: 98,467 x 2
#> ID km
#> <int> <dbl>
#> 1 763 0.750
#> 2 764 0.751
#> 3 765 0.751
#> 4 766 0.753
#> 5 767 0.753
#> 6 768 0.754
#> 7 769 0.754
#> 8 770 0.755
#> 9 771 0.755
#> 10 772 0.757
#> # … with 98,457 more rows
EDIT: trying to visually display the problem
The final result needs to be a numerical vector (or df) that assesses the entire data set, not just each row separately. As an example for the first two lines, see the following representation:
So, see that:
This way, right away, the random set of data within the first 4000 meters, could only have numbers from 3030 + .750.
The question, then, is to try to do this programmatically so that all lines of the data frame are evaluated and the numbers generated are not within all the conditions mentioned.
I think I understand. You want to sample in the gaps delimited by the distances, with the complicating factor that you cannot sample on either side of the marked distances for 750m.
I think it is useful to get a clearer visual understanding of the problem. In this plot, the x axis represents distance (the y axis is just a "dummy" axis, since we are only interested in the x axis). The black bars are the "exclusion zones" in which we cannot sample. There are also 750m zones on either side of the exclusion zones in which we do not want to sample, which are here coloured red:
So essentially, we want a uniform sample from the non-shaded areas of the x axis in this plot.
My solution is to first merge the overlapping segments, then create a sample space that is weighted according to the size of each gap and take 50 uniform samples from that space.
Here, I have generalized to allow arbitrary limits and sample size.
sample_space <- function(km_ini, km_fin, km_max = 99.45, buffer = 0.75, n = 50)
{
# Find and merge overlaps
i <- 1
km_ini <- km_ini - buffer
km_fin <- km_fin + buffer
while(i <= length(km_ini))
{
overlaps <- which(km_ini < km_fin[i] & km_fin > km_ini[i])
if(length(overlaps) < 2) {i <- i + 1; next;}
km_ini <- c(km_ini, min(km_ini[overlaps]))
km_fin <- c(km_fin, max(km_fin[overlaps]))
km_ini <- km_ini[-overlaps]
km_fin <- km_fin[-overlaps]
i <- 1
}
# Create a matrix of appropriate gaps
gaps <- cbind(sort(km_fin), c(sort(km_ini)[-1], km_max))
print(gaps)
# Create a weighted sample space
splits <- c(0, cumsum(apply(gaps, 1, diff)))
# Take a sample within that space
runifs <- runif(n, 0, max(splits))
# Add the appropriate starting value back on
index <- as.numeric(cut(runifs, splits))
runifs - splits[index] + gaps[index, 1]
}
So now we can do
sample_space(df$km_ini, df$km_fin)
#> [1] 93.107858 92.216660 83.597703 86.341198 72.258245 86.591883 18.572744
#> [8] 16.641163 73.344658 73.075426 78.230074 97.745802 52.654342 52.298444
#> [15] 70.029034 67.430346 95.328900 62.250864 79.144025 86.344868 7.063474
#> [22] 58.797335 79.304272 54.731057 32.137068 84.837576 94.226992 50.808135
#> [29] 65.987277 76.666933 29.225744 33.309866 13.013735 6.925277 65.207383
#> [36] 91.591293 61.614993 18.646062 97.550237 48.478875 12.860920 20.263471
#> [43] 34.980616 50.583291 15.813562 96.104448 91.310377 53.063613 17.376281
#> [50] 72.511153
To show that this does what we wanted it to, let's plot the sample over the plot of exclusion zones:
set.seed(69)
sample_df <- data.frame(x = sample_space(df$km_ini, df$km_fin),
y = runif(50, 0.45, 0.55))
ggplot(df) +
geom_rect(aes(xmin = km_ini - 0.75, xmax = km_fin + 0.75, ymin = 0, ymax = 1),
alpha = 0.5, fill = "red") +
geom_rect(aes(xmin = km_ini, xmax = km_fin, ymin = 0, ymax = 1), fill = "black") +
geom_rect(aes(xmin = 0, xmax = 0.75, ymin = 0, ymax = 1), alpha = 0.5) +
geom_rect(aes(xmin = 99.45, xmax = 100, ymin = 0, ymax = 1), alpha = 0.5) +
labs(x = "distance", y = "dummy") +
geom_point(data = sample_df, aes(x = x, y = y), col = "blue")
Created on 2020-03-01 by the reprex package (v0.3.0)