Can anyone tell me what the line of code written below do?
sapply(X, function(x) sum(is.na(x))) / nrow(airports) * 100
What is understood is that it will drop NA
s when it applies the sum function but keeps them in the matrix.
Any help is appreciated.
Thank you
Enough comments, time for an answer:
sapply(X, # apply to each item of X (each column, if X is a data frame)
function(x) # this function:
sum(is.na(x)) # count the NAs
) / nrow(airports) * 100 # then divide the result by the number of rows in the the airports object
# and multiply by 100
In words, it counts the number of missing values in each column of X
, then divides the result by the number of rows in airports
and multiplies by 100. Calculating the percentage of missing values in each column, assuming X
has the same number of rows as airports
.
It's strange to mix and match the columns of X
with the nrow(airports)
, I would expect those to be the same (that is, either sapply(airports, ...) / nrow(airports)
or sapply(X, ...) / nrow(X)
.
As I mentioned in comments, nothing is being "dropped". If you wanted to do a sum
ignoring the NA
values, you do sum(foo, na.rm = TRUE)
. Instead, here, *what is being summed is is.na(x)
, that is we are summing whether or not each value is missing: counting missing values. sum(is.na(foo))
is the idiomatic way to count the number of NA
values in foo
.
In this case, where the goal is a percent not a count, we can simplify by using mean()
instead of sum() / n
:
# slightly simpler, consistent object
sapply(airports, function(x) mean(is.na(x))) * 100
We could also use is.na()
on the entire data so we don't need the "anonymous function":
# rearrange for more simplicity
sapply(is.na(airports), mean) * 100