Search code examples
rstatisticscutbinning

Get 2D table (6x6) for dataframe containing two continuous variables, by binning


I am trying to partition observations in a data frame into 36 groups, based on two continuous variables. More specifically, I am trying to cut each of the two variables into six groups, and then group the observations in one of the 36 different possible groups.

My attempt is below, which works. But is there a faster way to do this that avoids the double for loops?

Also, this isn't necessary, but how could I visualize the total number of observations in each group in a 6 by 6 grid? I know table() would produce a list of the 36 possible groups and their totals, but not in grid format.

set.seed(123)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
data <- data.frame(x1,x2)

labs1 <- levels(cut(x1, 6))
ints1 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs1)),
               upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs1)))
labs2 <- levels(cut(x2, 6))
ints2 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs2)),
               upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs2)))

tmp <- expand.grid(labs1, labs2)
groups <- cbind(lower1 =  as.numeric(sub("\\((.+),.*", "\\1", tmp[,1])), 
                upper1 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,1])), 
                lower2 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,2])),
                upper2 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,2])))

for (i in 1:1000){
  for (j in 1:36){
    if (x1[i] >= groups[j,1] & x1[i] <= groups[j,2] &
        x2[i] >= groups[j,3] & x2[i] <= groups[j,4]){
      data$group[i] <- j
    }
  }
}

Solution

  • You can use a mix of apply() that will iterate thru your data.frame and which() that will iterate thru your groups array:

    data$group <- apply(data, 1, FUN=function(dataRow) 
      which(
        dataRow[1] >= groups[,1] & 
        dataRow[1] <= groups[,2] & 
        dataRow[2] >= groups[,3] & 
        dataRow[2] <= groups[,4]))