Search code examples
rcut

Using cut() to make factor


I am trying to make a factor variable out of a numeric variable in R. I would like to keep track of NA's and the new bins I am creating. Within the new bins, some numbers are of a valid range and some are not. I care about the bins themselves but want to create an "invalid" level that will house anything that does not fall in a designated range.

Here is an example:

library(reshape)

fac <- c(-1, 1, 2, 3, 4, 100, NA)
fac <- cut(fac, c(-Inf, 1, 2, 3, Inf))
fac <- addNA(fac)
combine_factor(fac, 
           variable=order(levels(fac))[c(2,3,5)],
           other.label = "Invalid")

Which would give me some output that would have the levels I want to be intervals, NA, or invalid.

However, the trouble I am having is I do not want to code the variable using numbers because I have multiple different data sets and not all of them contain each level of the factor.

If I change the variable so that it does not contain any of a certain level of the factor:

fac <- c(-1, 1, 3, 4, 100, NA)

I keep getting the error:

Error in factor(nvar[as.numeric(fac)], labels=c(levels(fac)[variable], : invalid 'labels'; length 4 should be 1 or 3.

Output 1 (which works because I have no levels occurring 0 times):

[1] (1,2]   (1,2]   (2,3]   <NA>    Invalid Invalid Invalid
Levels: (1,2] (2,3] <NA> Invalid

Output 2 (where one level: (1,2] has 0 occurrences):

[1]   (2,3]   <NA>    Invalid Invalid Invalid 
Levels: (1,2] (2,3] <NA> Invalid

The second scenario is where I experience the error.

Is there any way I can get around this error?


Solution

  • I don't know much about the combine_factor function, but it seems pretty easy to write your own....

    Here's a basic example:

    NewLevs <- function(fac, keep, others = "Invalid") {
      lf <- levels(fac)
      nl <- c(setNames(as.list(lf[keep]), lf[keep]),
        setNames(as.list(lf[-keep]), rep(others, length(lf)-length(keep))))
      levels(fac) <- nl
      fac
    }
    

    Here's some sample data:

    fac1 <- c(-1, 1, 2, 3, 4, 100, NA)
    fac1 <- addNA(cut(fac1, c(-Inf, 1, 2, 3, Inf)))
    
    fac2 <- c(-1, 1, 3, 4, 100, NA)
    fac2 <- addNA(cut(fac2, c(-Inf, 1, 2, 3, Inf)))
    

    Put the function to work:

    fac1
    # [1] (-Inf,1] (-Inf,1] (1,2]    (2,3]    (3, Inf] (3, Inf] <NA>    
    # Levels: (-Inf,1] (1,2] (2,3] (3, Inf] <NA>
    NewLevs(fac1, c(2, 3, 5))
    # [1] Invalid Invalid (1,2]   (2,3]   Invalid Invalid <NA>   
    # Levels: (1,2] (2,3] <NA> Invalid
    
    
    fac2
    # [1] (-Inf,1] (-Inf,1] (2,3]    (3, Inf] (3, Inf] <NA>    
    # Levels: (-Inf,1] (1,2] (2,3] (3, Inf] <NA>
    NewLevs(fac2, c(2, 3, 5))
    # [1] Invalid Invalid (2,3]   Invalid Invalid <NA>   
    # Levels: (1,2] (2,3] <NA> Invalid
    

    The desired levels plus the label for unwanted levels can be changed:

    NewLevs(fac2, c(1, 2, 3), "Wrong")
    # [1] (-Inf,1] (-Inf,1] (2,3]    Wrong    Wrong    Wrong   
    # Levels: (-Inf,1] (1,2] (2,3] Wrong