Search code examples
rcut

Finding a value in an interval


Sorry if this is a basic question. Have been trying to figure this out but not being able to. I have a vector of values called sym.

> head(sym)
         [,1]
val  3.652166e-05
val -2.094026e-05
val  4.583950e-05
val  6.570184e-06
val -1.431486e-05
val -5.339604e-06

These I put in intervals by using factor on cut function on sym.

factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))

 [1] (2.82e-05,5.28e-05]  (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05]  (3.55e-06,2.82e-05]    (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06] 
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05]  (3.55e-06,2.82e-05]  (7.74e-05,0.000102] 

Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]

So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6. My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.

Thanks


Solution

  • If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.

    So first, I will generate some sample data.

    set.seed(18)
    x <- runif(50)
    

    Now I will show two different way to calculate breaks. Here are b1() and b2()

    b1<-function(x, n=nclass.Sturges(x)) {
        #like default cut()
        nb <- as.integer(n + 1)
        dx <- diff(rx <- range(x, na.rm = TRUE))
        if (dx == 0) 
            dx <- abs(rx[1L])
        seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, 
            length.out = nb)
    }
    b2<-function(x, n=nclass.Sturges(x)) {
        #like default hist()
        pretty(range(x), n=n)
    }
    

    So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor

    mybreaks <- b1(x)
    factorx <- cut(x,breaks=mybreaks))
    

    (Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared

    nv <- runif(5)
    grp <- findInterval(nv,mybreaks)
    

    And we can check the results with

    data.frame(grp=levels(factorx)[grp], x=nv)
    #              grp         x
    # 1  (0.831,0.969] 0.8769438
    # 2 (0.00131,0.14] 0.1188054
    # 3  (0.416,0.554] 0.5467373
    # 4   (0.14,0.278] 0.2327532
    # 5  (0.554,0.693] 0.6022678
    

    and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.