I have got a tibble
of more than 2 million rows. One of the columns size
is a value using M to represent million, k to represent thousand; it also has some <NA>
values. The column type is character
, like the following:
size
1.3M
5k
302
8.6M
<NA>
4.4k
21
...and so on.
I tried the following code:
for (i in 1:length(example$size)) {
if (!is.na(example$size[i])) {
if (str_sub(example$size[i],-1,-1) == "M") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000000
} else if (str_sub(example$size[i],-1,-1) == "k") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000
}
}
}
But it took more than half hour and still running, so I interrupted that as I was not sure if my code was wrong and it's in a infinite loop. Is there anything wrong or any way of coding to improve the efficiency?
You could accomplish everything using str_replace_all
and as.numeric
:
as.numeric(stringr::str_replace_all(size, c(M = "e6", k = "e3")))
[1] 1300000 5000 NA 21 4400
Edit:
A faster approach is to use baseR sub
function twice:
as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE))
A quick microbenchmark check shows that this method is the fastest:
microbenchmark::microbenchmark(
a = as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE)),
b = as.numeric(str_replace_all(bigsize, c(M = "e6", k = "e3"))),
rep1 = rep1(bigsize),
rep2 = rep2(bigsize),
rep3 = rep3(bigsize),
rep4 = rep4(bigsize),
rep5 = rep5(bigsize_df), times=3)
Unit: milliseconds
expr min lq mean median uq max neval
a 621.1582 638.9055 664.4689 656.6529 686.1242 715.5955 3
b 1102.8758 1108.1215 1118.1558 1113.3673 1125.7958 1138.2244 3
rep1 1450.3998 1478.7379 1547.1752 1507.0761 1595.5629 1684.0497 3
rep2 6144.4160 6419.0407 8411.8940 6693.6654 9545.6329 12397.6005 3
rep3 19224.9825 19225.2984 19427.0457 19225.6143 19528.0773 19830.5402 3
rep4 1188.0552 1310.4584 1368.6480 1432.8616 1458.9444 1485.0273 3
rep5 3056.1525 3177.7098 3672.9781 3299.2671 3981.3909 4663.5148 3