Search code examples
rreplacenafactors

replace NAs on one factor with values from another factor


There's something very basic I am missing here

d <- data.frame(
g0  = c("A", "B", NA, NA, "C", "C"),
g1  = LETTERS[1:6])
d
    g0 g1
1    A  A
2    B  B
3 <NA>  C
4 <NA>  D
5    C  E
6    C  F

The I have this code, but it does not work

d$g0[is.na(d$g0)] <- d$g1[is.na(d$g0)]

Desired result.

d
    g0 g1
1    A  A
2    B  B
3    C  C
4    D  D
5    C  E
6    C  F

Solution

  • It's always helpful to remember the original design rationale behind factors. They were intended for categorical variables that took on one of a fixed set of values. So imagine I changed your example slightly to be:

    d <- data.frame(color  = c("red", "blue", NA, NA, "green", "green"),
                    amount  = c("high","low","low","mid","mid","high"))
    
    > d
      color amount
    1   red   high
    2  blue    low
    3  <NA>    low
    4  <NA>    mid
    5 green    mid
    6 green   high
    

    Now it totally makes sense that R complains when we run the following:

    > d$color[is.na(d$color)] <- d$amount[is.na(d$color)]
    Warning message:
    In `[<-.factor`(`*tmp*`, is.na(d$color), value = c(3L, 1L, NA, NA,  :
      invalid factor level, NA generated
    

    because why would we ever want a color of "high" or "mid"? That makes no sense. The mental model here is that either two factors really have nothing to do with each other, or if they do, their levels should be the same. So,

    levels(d$color) <- c(levels(d$color),"low","mid")
    d$color[is.na(d$color)] <- d$amount[is.na(d$color)]
    

    this runs with no problems:

    > d
      color amount
    1   red   high
    2  blue    low
    3   low    low
    4   mid    mid
    5 green    mid
    6 green   high
    

    even if the result is semantically nonsensical.

    Of course, many people find all this factor level juggling irksome and would have simply done:

    d <- data.frame(color  = c("red", "blue", NA, NA, "green", "green"),
                    amount  = c("high","low","low","mid","mid","high"), 
                    stringsAsFactors = FALSE)
    

    and then R won't care what you fill the NA values with at all, because they aren't factors anymore.