Search code examples
rlattice

R dividing dataset into ranged bins?


I am having some problems sorting my dataset into bins, that based on the numeric value of the data value. I tried doing it with the function shingle from the lattice which seem to split it accurately.

I can't seem to extract the desired output which is the knowledge how the data is divided into the predefined bins. I seem only able to print it.

bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
                        0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
                        ncol = 2, nrow =  16)
bin_1 = shingle(data_1,intervals = bin_interval)

How do i extract the intervals which is outputted by the shingle function, and not only print it...

the intervals being the output:

Intervals:
    min  max count
1  0.38 0.40     0
2  0.42 0.44     6
3  0.46 0.48    46
4  0.50 0.52   251
5  0.54 0.56   697
6  0.58 0.60  1062
7  0.62 0.64  1215
8  0.66 0.68  1227
9  0.70 0.72  1231
10 0.74 0.76  1293
11 0.78 0.80  1330
12 0.82 0.84  1739
13 0.86 0.88  2454
14 0.90 0.92  3048
15 0.94 0.96  8936
16 0.98 1.00 71446

As an variable, that can be fed to another function.


Solution

  • The shingle() function returns the values using attributes().

    The levels are specifically given by attr(bin_1,"levels").

    So:

    set.seed(1337)
    data_1 = runif(100)
    
    bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
                            0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
                            ncol = 2, nrow =  16)
    bin_1 = shingle(data_1,intervals = bin_interval)
    
    attr(bin_1,"levels")
    

    This gives:

          [,1] [,2]
     [1,] 0.38 0.40
     [2,] 0.42 0.44
     [3,] 0.46 0.48
     [4,] 0.50 0.52
     [5,] 0.54 0.56
     [6,] 0.58 0.60
     [7,] 0.62 0.64
     [8,] 0.66 0.68
     [9,] 0.70 0.72
    [10,] 0.74 0.76
    [11,] 0.78 0.80
    [12,] 0.82 0.84
    [13,] 0.86 0.88
    [14,] 0.90 0.92
    [15,] 0.94 0.96
    [16,] 0.98 1.00
    

    Edit

    The count information for each interval is only computed within the print.shingle method. Thus, you would need to run the following code:

    count.shingle = function(x){
      l <- levels(x)
      n <- nlevels(x)
      int <- data.frame(min = numeric(n), max = numeric(n), 
                        count = numeric(n))
      for (i in 1:n) {
        int$min[i] <- l[[i]][1]
        int$max[i] <- l[[i]][2]
        int$count[i] <- length(x[x >= l[[i]][1] & x <= l[[i]][2]])
      }
    
      int
    }
    
    a = count.shingle(bin_1)
    

    This gives:

    > a 
       min  max count
    1  0.38 0.40     0
    2  0.42 0.44     1
    3  0.46 0.48     3
    4  0.50 0.52     1
    5  0.54 0.56     2
    6  0.58 0.60     2
    7  0.62 0.64     2
    8  0.66 0.68     4
    9  0.70 0.72     1
    10 0.74 0.76     3
    11 0.78 0.80     2
    12 0.82 0.84     2
    13 0.86 0.88     5
    14 0.90 0.92     1
    15 0.94 0.96     1
    16 0.98 1.00     2
    

    where a$min is lower range, a$max is upper range, and a$count is the number within the bins.