Search code examples
rplotdata-visualizationhistogramdata-manipulation

R: Find out which observations are located in each "bar" of the histogram


I am working with the R programming language. Suppose I have the following data:

     a = rnorm(1000,10,1)
     b = rnorm(200,3,1)
     c = rnorm(200,13,1)
    
    d = c(a,b,c)
index <- 1:1400

my_data = data.frame(index,d)

I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):

hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
 hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
 hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")

enter image description here

My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?

Right now, I am trying to manually do this, e.g. (for histogram #3)

histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]


head(histogram3_bar1)

    index        d
1001  1001 4.156393
1002  1002 3.358958
1003  1003 1.605904
1004  1004 3.603535
1006  1006 2.943456
1007  1007 1.586542

But is there a more "efficient" way to do this?

Thanks!


Solution

  • hist itself can provide for the solution to the question's problem, to find out which data points are in which intervals. hist returns a list with first member breaks

    First, make the problem reproducible by setting the RNG seed.

    set.seed(2021)
    a = rnorm(1000,10,1)
    b = rnorm(200,3,1)
    c = rnorm(200,13,1)
    d = c(a,b,c)
    

    Now, save the return value of hist and have findInterval tell the bins where each data points are in.

    h1 <- hist(d, breaks = 10)
    f1 <- findInterval(d, h1$breaks)
    
    h1$breaks
    # [1] -2  0  2  4  6  8 10 12 14 16
    
    head(f1)
    #[1] 6 7 7 7 7 6
    

    The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:

    head(d[f1])
    #[1]  8.07743 10.26174 10.26174 10.26174 10.26174  8.07743
    

    As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").

    As a final check, table the values returned by findInterval and see if they match the histogram's counts.

    table(f1)
    #f1
    #  1   2   3   4   5   6   7   8   9 
    #  2  34 130  34  17 478 512 169  24 
    h1$counts
    #[1]   2  34 130  34  17 478 512 169  24
    

    To have the intervals for each data point, the following

    bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
    head(bins)
    #  bin min max
    #1   6   8  10
    #2   7  10  12
    #3   7  10  12
    #4   7  10  12
    #5   7  10  12
    #6   6   8  10