Search code examples
rnamissing-data

How are missings represented in R?


Beforehand

Most obvious answer to the title is that missings are represented with NA in R. Dummy data:

x <- c("a", "NA", "<NA>", NA)

We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.

addNA

But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:

x_new <- addNA(x)
x_new
[1] a    NA   <NA> <NA>
Levels: <NA> a NA <NA>

Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:

as.character(x_new)
[1] "a"    "NA"   "<NA>" NA

How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?


Solution

  • That's probably a uncleanness in the base:::print.factor() method.

    x <- c("a", "NA", "<NA>", NA)
    
    addNA(x)
    # [1] a    NA   <NA> <NA>
    # Levels: <NA> a NA <NA>
    

    But:

    levels(addNA(x))
    # [1] "<NA>" "a"    "NA"   NA    
    

    So, there are no duplicated levels.