Search code examples
rvectornamissing-data

How to use is.na to identify NA, " ", "" etc


I have 2 problems:

Problem 1: I am trying to work out how to identify any common missing value formats like NA, " ", "".

I thought is.na would identify all of these formats, can someone point me in the right direction for what I need to do here?

Problem 2: I need to count the NA, " " and "" values and list the position for all of them.

Ive tried:

```{r, echo=TRUE,include=TRUE}
sum(is.na(DF))
which(is.na(DF))
```

but it only counts the NA values (16) and tells me which value position they are in.

However, I also happen to know there are 10 values in my dataset that are missing and their format isnt NA, its " ", so the total for missing values should be 26 and I should get the value position for all of them.

I tried using something like:

sum(is.na(DF, na.strings=c("NA"," ","")))

But I got this error: Error in is.na(DF, na.strings = c("NA", " ", "")) : 2 arguments passed to 'is.na' which requires 1

Any ideas on what to do here would be amazing as well.

Thank you!


Solution

  • is.na only detects NA values, not " " nor "". You can convert " " and "" to NA using gsub, and then use is.na:

    v = c(NA, "", " ", "A")
    gsub("^$|^ $", NA, v)
    # [1] NA  NA  NA  "A"
    
    sum(is.na(gsub("^$|^ $", NA, v)))
    # [1] 3
    
    which(is.na(gsub("^$|^ $", NA, v)))
    # [1] 1 2 3
    

    Explanation: ^$ captures empty string (^ defines the beginning of the string and $ the end). ^ $ captures a string with one space (with the same anchors having the same purpose), and | is the OR operator.