Search code examples
rdataframelattice

Discrete bins using cut()


I want to plot the data [using lattice's xyplot()] in my dataframe age.model, based on discrete bins of the column StartAge.

I am using the following code:

# set up boundaries for intervals/bins
breaks <- c(0,3,4,5,6,8,13,15,17,18,19,20,22)
# specify interval/bin labels
labels <- c("<3", "3-4)", "4-5)","5-6)", "6-8)","8-13)", "13-15)","15-17)","17-18)","18-19)","19-20)",">=20")
# bucketing data points into bins
bins <- cut(age.model$StartAge, breaks, include.lowest = T, right=FALSE, labels=labels)
# inspect bins
summary(bins)

In cut()'s first argument, I have specified the column by which I want to discretize. However, the factor that is returned does not include the whole DF. How can I do this?

Reproducible using dput:

structure(list(Height = c(0.207224416925809, -1.19429150954007, 
0.0247585682642494, 0.023546515879641, 1.51423735121426, -1.09376538778425, 
-0.125209484617016, -0.63639210765747, 0.305071992864995, -0.422021082477656
), Weight = c(-0.366133564723644, -1.06969961340686, -0.0793604259237282, 
-0.708230200986797, 1.71593234004357, -0.685215310472794, -1.20353653394014, 
-0.490399232488568, 0.742874184424376, -0.331519044995803), Training = c(19, 
27, 27, 24, 35, 23, 15, 14, 47, 7), StartAge = c(13, 19, 20, 
20, 14, 2, 8, 4, 17, 18)), row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10"), class = "data.frame")

Solution

  • If you're using xyplot to explore your data, consider using equal.count() or shingle() in your code. Having (clueless) fun with your data, the approximate linear relationship between weight and height appears to not hold for the lower StartAge bins as show in the first example.

    # Starting with data in age.model
      library(lattice)
      xyplot(Weight ~ Height | equal.count(StartAge), age.model, type = c("p", "r"))
    

    The default number of bins for equal.count is 6. It can be changed easily to explore other groupings:

    # Create four groups of equal counts to explore
      xyplot(Weight ~ Height | equal.count(StartAge, 4), age.model, type = c("p", "r"))
    

    The shingle() function allows for overlapping bins as shown here.

    # Create three groups that overlapping with each other 
      bins <- cbind(lower = c(0,8,16), upper = c(13,18,24))
      xyplot(Weight ~ Height | shingle(StartAge, bins), age.model, type = c("p", "r"))