Search code examples
rhistogram

Get index of density bin values of histogram in hist() R


I would like to get the index values of the bins in a histogram generated via hist()

Example and details follow:

testhist <- hist(rnorm(1000, 1000, 100), n = 5000, xlim = c(0,5000), probability = TRUE)

gives testhist$density, which are my 'y' values. So, in the code I define n = 5000, that is 5000 bins across x 0:5000. I would like to get the index value of the histogram bin each 'y' value corresponds to.

i.e:

Bin Index  |  'y' value
1           0
1           0.000005
1           0
1           0
1           0.0000001
2           0.00002
3           0
3           0.0002
...5000

Any assistance is appreciated.

EDIT: as commenters pointed out, n= is an approximation. So, lets do this:

testhist <- hist(rnorm(1000, 1000, 100), breaks = seq(0,5000, by = 5), xlim = c(0,5000), probability = TRUE)

Now, you would have 1000 exact bins. How to get the index of a bin corresponding to a 'y' value. i.e. bin 1, which has range of 0:5, has what y values in it?

EDIT 2: Each bin would correspond to a density, the more number of bins, the more representative the data would be. Thanks for steering me into the right direction.


Solution

  • There's a bit of confusion about what hist does or doesn't do here.

    1. There is no n= argument to hist, only breaks=. I think It gives the same result by chance since pretty() uses n= and that function is used to define the bins.
    2. Setting breaks=5000 does not guarantee 5000 bins, as @Onyambu notes, due to pretty()-ification of the break-points. From ?hist: ...the number is a suggestion only; as the breakpoints will be set to pretty values.
    3. testhist$density gives a density in each bin. You can verify this with:

    set.seed(1)
    x <- rnorm(1000, 1000, 100)
    testhist <- hist(x, n=5000, xlim = c(0,5000), probability = TRUE)
    length(testhist$mids)
    #[1] 6820
    length(testhist$density)
    #[1] 6820
    length(testhist$breaks)
    #[1] 6821
    

    6820 midpoints of bins, 6820 corresponding densities, and 6821 breaks since you need n+1 breaks to give n bins.

    The original 1000 data-points are represented in these 6820 bins, with many of the counts and corresponding densities being zero.

    sum(testhist$counts)
    #[1] 1000
    sum(testhist$counts == 0)
    #[1] 5954
    sum(testhist$density == 0)
    #[1] 5954
    

    If you want to know which original value of x corresponds with which bin, you can do:

    cut(x, testhist$breaks, labels=FALSE)