Search code examples
rlapplynasapply

Using is.na with Sapply function in R


Can anyone tell me what the line of code written below do?

sapply(X, function(x) sum(is.na(x))) / nrow(airports) * 100

What is understood is that it will drop NAs when it applies the sum function but keeps them in the matrix.

Any help is appreciated.

Thank you


Solution

  • Enough comments, time for an answer:

    sapply(X,      # apply to each item of X (each column, if X is a data frame)
      function(x)  # this function:
        sum(is.na(x))  # count the NAs
    ) / nrow(airports) * 100  # then divide the result by the number of rows in the the airports object
      # and multiply by 100
    

    In words, it counts the number of missing values in each column of X, then divides the result by the number of rows in airports and multiplies by 100. Calculating the percentage of missing values in each column, assuming X has the same number of rows as airports.

    It's strange to mix and match the columns of X with the nrow(airports), I would expect those to be the same (that is, either sapply(airports, ...) / nrow(airports) or sapply(X, ...) / nrow(X).

    As I mentioned in comments, nothing is being "dropped". If you wanted to do a sum ignoring the NA values, you do sum(foo, na.rm = TRUE). Instead, here, *what is being summed is is.na(x), that is we are summing whether or not each value is missing: counting missing values. sum(is.na(foo)) is the idiomatic way to count the number of NA values in foo.

    In this case, where the goal is a percent not a count, we can simplify by using mean() instead of sum() / n:

    # slightly simpler, consistent object
    sapply(airports, function(x) mean(is.na(x))) * 100
    

    We could also use is.na() on the entire data so we don't need the "anonymous function":

    # rearrange for more simplicity
    sapply(is.na(airports), mean) * 100