I am trying to remove NAs in R. I have tried to replicate a simple example I have found multiple places online but am getting an unexpected output. I cannot find the error through searching online. What am I doing wrong?
I am using R version 4.3.2. I have restarted R and cleared the global variables (and restarted R again) and consistently get this result with anything I try.
a <- c(1,2,NA,3,4,NA,5,6)
b<- na.omit(a)
b
The output is
[1] 1 2 3 4 5 6
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "omit"
I was expecting to get the output 1 2 3 4 5 6
I have found I can instead use b <- a[!(is.na(a))]
, but curious why the commonly suggested na.omit does not work.
You do get the intended values in the output. What I think you misunderstand is that the attr(,"na.action")
and attr(,"class")
are simply attributes attached to the numeric vector with six non-NA
numbers in it. If you do b+1
, you'll get the values incremented:
b + 1
# [1] 2 3 4 5 6 7
# attr(,"na.action")
# [1] 3 6
# attr(,"class")
# [1] "omit"
If you really want to use na.omit
and remove the attributes, you can do:
attributes(b) <- NULL
b
# [1] 1 2 3 4 5 6
Ultimately, though, a[!is.na(a)]
is much much faster, and still should be safe. Look at the `itr/sec`
field to see that a[!is.na(a)]
is ~10x faster on this small vector.
bench::mark(
isna = a[!is.na(a)]
omit = na.omit(a),
omit_no_attr = `attributes<-`(na.omit(a), NULL),
check = FALSE)
# # A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 isna 311.88ns 325.96ns 2319673. NA 0 10000 0 4.31ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 2 omit 2.8µs 3.29µs 236026. NA 53.8 4389 1 18.59ms <NULL> <NULL> <bench_tm [4,390]> <tibble [4,390 × 3]>
# 3 omit_no_attr 2.91µs 3.38µs 286354. NA 0 10000 0 34.92ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
Even on a medium-large vector, it's still faster:
a_medium <- rep(a, 1000)
bench::mark(isna = a_medium[!is.na(a_medium)], omit = na.omit(a_medium), omit_no_attr = `attributes<-`(na.omit(a_medium), NULL) , check = FALSE)
# # A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 isna 16.2µs 18.3µs 53627. NA 5.36 9999 1 186ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 2 omit 29.4µs 33.4µs 29944. NA 0 10000 0 334ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 3 omit_no_attr 29.5µs 33.7µs 29215. NA 2.92 9999 1 342ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
But if it gets a lot larger, we start seeing some parity:
a_big <- rep(a, 100000)
bench::mark(isna = a_big[!is.na(a_big)], omit = na.omit(a_big), omit_no_attr = `attributes<-`(na.omit(a_big), NULL) , check = FALSE)
# # A tibble: 3 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 isna 2.03ms 2.19ms 452. NA 2.10 215 1 475ms <NULL> <NULL> <bench_tm [216]> <tibble [216 × 3]>
# 2 omit 3.08ms 3.3ms 259. NA 2.05 126 1 487ms <NULL> <NULL> <bench_tm [127]> <tibble [127 × 3]>
# 3 omit_no_attr 3.1ms 3.27ms 302. NA 2.05 147 1 487ms <NULL> <NULL> <bench_tm [148]> <tibble [148 × 3]>
but since we're talking on the order if 2-3ms for a vector 800,000 long, the payoff might not be worth the squeeze.