I think I found a weird behavior in R's function base::findInterval()
.
I have a long vector p
of cumulative values in the range (0,1]. For the sake of simplicity, we could define p
as
ones <- rep(1, 2000)
p <- cumsum(ones)/sum(ones)
head(p)
#> [1] 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030
tail(p)
#> [1] 0.9975 0.9980 0.9985 0.9990 0.9995 1.0000
Created on 2023-09-28 with reprex v2.0.2
I need to find the indexes of p
of those values for which the interval changes based on vector probs
of breakpoints: probs <- seq(1/nbins,1, 1/nbins)
, where nbins
(100, in my case) is the number of breaks.
I am using findInterval()
for this task and I found that for some sections of p
it correctly identifies the intervals, but for others, it does not.
In the example below, I created vectors, x
, y
, and z
, which are subsets of p
but are sufficient to show the problem. Notice that findInterval()
works great with x
and y
, but it misidentifies the interval in z
. Does any of you know why? Am I doing something wrong?
# Number of bins
nbins <- 100
# Cuts in the cumulative distribution
probs <- seq(1/nbins,1, 1/nbins)
# subsets of the large cumulative distribution
x <- c(0.0095, 0.01, 0.0105, 0.011)
y <- c(0.0495, 0.05, 0.0505, 0.051)
z <- c(0.0595, 0.06, 0.0605, 0.061)
# These two work fine
findInterval(x,probs)
#> [1] 0 1 1 1
findInterval(y,probs)
#> [1] 4 5 5 5
# this should be c(5, 6, 6, 6) NOT c(5, 5, 6, 6)
findInterval(z,probs)
#> [1] 5 5 6 6
Created on 2023-09-28 with reprex v2.0.2
Notice that in this example, p
is a straightforward vector and we know the actual indexes beforehand. But in a real case scenario, we won't know if findInterval()
is working correctly.
As a side note, I am identifying the indexes with the code below. I think it should be working fine if findInterval()
works correctly. Do you know if there is any other, super-efficient way to find the indexes?
# these are the indexes
ind <-
findInterval(p, probs) |>
diff() |> {
\(.) which(. == 1)
}() + 1
I think this is an example of the floating point problem in R. Below you can see that even though probs[6]
prints as 0.06
it is not equivalent to that number:
nbins <- 100
# Cuts in the cumulative distribution
probs <- seq(1/nbins,1, 1/nbins)
probs[6]
#> [1] 0.06
probs[6] == .06
#> [1] FALSE
If you round the probabilities in probs
even to 15 decimal places, you get the expected result.
# subsets of the large cumulative distribution
x <- c(0.0095, 0.01, 0.0105, 0.011)
y <- c(0.0495, 0.05, 0.0505, 0.051)
z <- c(0.0595, 0.06, 0.0605, 0.061)
findInterval(z, probs)
#> [1] 5 5 6 6
findInterval(z, round(probs, 15))
#> [1] 5 6 6 6
Created on 2023-09-28 with reprex v2.0.2