Search code examples
rintervals

findInterval() in R behaves unexpectedly


I think I found a weird behavior in R's function base::findInterval().

I have a long vector p of cumulative values in the range (0,1]. For the sake of simplicity, we could define p as

ones <- rep(1, 2000)
p <- cumsum(ones)/sum(ones)
head(p)
#> [1] 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030
tail(p)
#> [1] 0.9975 0.9980 0.9985 0.9990 0.9995 1.0000

Created on 2023-09-28 with reprex v2.0.2

Problem

I need to find the indexes of p of those values for which the interval changes based on vector probs of breakpoints: probs <- seq(1/nbins,1, 1/nbins), where nbins (100, in my case) is the number of breaks.

I am using findInterval() for this task and I found that for some sections of p it correctly identifies the intervals, but for others, it does not.

Example

In the example below, I created vectors, x, y, and z, which are subsets of p but are sufficient to show the problem. Notice that findInterval() works great with x and y, but it misidentifies the interval in z. Does any of you know why? Am I doing something wrong?

# Number of bins
nbins <- 100

# Cuts in the cumulative distribution
probs  <- seq(1/nbins,1, 1/nbins)

# subsets of the large cumulative distribution
x <- c(0.0095, 0.01, 0.0105, 0.011)
y <- c(0.0495, 0.05, 0.0505, 0.051)
z <- c(0.0595, 0.06, 0.0605, 0.061)

# These two work fine
findInterval(x,probs)
#> [1] 0 1 1 1
findInterval(y,probs)
#> [1] 4 5 5 5

# this should be c(5, 6, 6, 6) NOT c(5, 5, 6, 6)
findInterval(z,probs)
#> [1] 5 5 6 6

Created on 2023-09-28 with reprex v2.0.2

Implication

Notice that in this example, p is a straightforward vector and we know the actual indexes beforehand. But in a real case scenario, we won't know if findInterval() is working correctly.

side note

As a side note, I am identifying the indexes with the code below. I think it should be working fine if findInterval() works correctly. Do you know if there is any other, super-efficient way to find the indexes?

# these are the indexes
ind <-
   findInterval(p, probs) |>
    diff() |> {
    \(.) which(. == 1)
     }() + 1

Solution

  • I think this is an example of the floating point problem in R. Below you can see that even though probs[6] prints as 0.06 it is not equivalent to that number:

    nbins <- 100
    
    # Cuts in the cumulative distribution
    probs  <- seq(1/nbins,1, 1/nbins)
    
    probs[6]
    #> [1] 0.06
    probs[6] == .06
    #> [1] FALSE
    

    If you round the probabilities in probs even to 15 decimal places, you get the expected result.

    # subsets of the large cumulative distribution
    x <- c(0.0095, 0.01, 0.0105, 0.011)
    y <- c(0.0495, 0.05, 0.0505, 0.051)
    z <- c(0.0595, 0.06, 0.0605, 0.061)
    
    findInterval(z, probs)
    #> [1] 5 5 6 6
    findInterval(z, round(probs, 15))
    #> [1] 5 6 6 6
    

    Created on 2023-09-28 with reprex v2.0.2