Search code examples
rdiscretization

Discretizing the log of a continuous variable


I am trying to discretize a continuous variable, cutting it into three levels. I want to do the same thing for the log of the positive continuous variable (in this case, income).

require(dplyr)
set.seed(3)
mydata = data.frame(realinc = rexp(10000))

summary(mydata)

new = mydata %>% 
  select(realinc) %>%
  mutate(logrealinc = log(realinc),
         realincTercile = cut(realinc, 3),
         logrealincTercile = cut(logrealinc, 3),
         realincTercileNum = as.numeric(realincTercile),
         logrealincTercileNum = as.numeric(logrealincTercile)) 

new[sample(1:nrow(new), 10),]

I would have thought that using cut() would produce identical levels for the discretized factors of each of these variables (income and log income), because log is a monotone function. So the two columns on the right here should be equal, but that doesn't seem to happen. What's going on?

> new[sample(1:nrow(new), 10),]
       realinc  logrealinc  realincTercile logrealincTercile realincTercileNum logrealincTercileNum
7931 0.2967813 -1.21475972 (-0.00805,2.83]     (-4.43,-1.15]                 1                    2
9036 0.9511824 -0.05004944 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
8204 4.5365676  1.51217069     (2.83,5.66]      (-1.15,2.15]                 2                    3
3136 2.0610693  0.72322490 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
9708 0.9655805 -0.03502581 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
5942 0.9149351 -0.08890215 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
4631 0.6987581 -0.35845064 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7309 1.9532566  0.66949804 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7708 0.4220254 -0.86268973 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
2965 1.3690976  0.31415186 (-0.00805,2.83]      (-1.15,2.15]                 1                    3

Edit: @nicola's comment explains the source of the problem. It seems that in cut's documentation, "equal-length intervals" refers to the length of the interval in the space of the continuous argument. I had originally interpreted "equal-length intervals" as meaning the number of elements assigned to each cut (on the output) would be equal (instead of the input).

Is there a function that does what I'm describing? -- where the number of elements in each output level are equal? Equivalently, where the levels of newfunc(realinc) and newfunc(logrealinc) are equal?


Solution

  • If you want your levels to be equally populated, take a look at the quantile function. Try for instance:

    x<-cut(new$realinc,quantile(new$realinc,0:3/3))
    y<-cut(new$logrealinc,quantile(new$logrealinc,0:3/3))
    all(as.integer(x)==as.integer(y),na.rm=TRUE)
    #[1] TRUE
    table(x)
    #x
    #(0.000444,0.396]     (0.396,1.12]      (1.12,8.49] 
    #            3333             3333             3333