Search code examples
rlevels

Set variable values to missing in R and drop unused levels


I have a data set, DATA, with a variable, VAR. This variables mode is numeric, and its class is a factor. It represents gender. When printed out, it looks something like below

 VAR
  M
  M
  F
  U

  M

When I print out levels, it outputs: "" "F" "M" "U", and a frequency table looks like this:

     F     M     U
 2   30    25    1

What I want to do is change everything that is not "F" or "M" to be a missing values, then label them "Man" and "Woman", and drop unused levels for the variable (but still leave a level for missing). So far I have the code below:

DATA$VAR[DATA$VAR == "U" | DATA$VAR == ""] <- NA

But I got the exact same values for the levels, and now the frequency table looks like this:

     F     M     U
 0   30    25    0

I feel like I'm close, but not quite there. I don't understand how to deal with the level issues. Any help is greatly appreciated.


Solution

  • To create a factor where everything bar what was M and F become missing use levels within a call to factor. To relabel these use the labels argument

    a <-  factor(c("M","M","F","U","","M"))
    
    a2 <- factor(a, levels = c('M','F'), labels =c('Male','Female'))
    
    a2
    # [1] Male   Male   Female <NA>   <NA>   Male  
    # Levels: Male Female
    

    If you want to tally NA values in table, set useNA = 'always' or useNA='ifany'

    table(a2, useNA = 'ifany')
    ##   a2
    ##   Male Female   <NA> 
    ##     3      1      2