I am converting a list of numbers to a matrix. The list of numbers is encoded text. Each word has a number associated with it, like 'the': 1, 'it': 2, etc. I want to get a matrix of values where the presence of an encoded word is represented by a '1'. So if one of our encoded texts looked like:
c(1, 4, 2)
Then the corresponding matrix (with a max word index of 10) would look like:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 0 1 0 0 0 0 0 0
So here is how I have it currently:
encoded.text <- list(c(1, 3, 2), c(1, 7, 8))
result <- matrix(0, nrow = length(encoded.text), ncol = 10)
for (i in 1:length(encoded.text)) {
result[i, encoded.text[[i]]] <- 1
}
I'm wondering, is there a better/more efficient way than the for loop to do this?
Here is one option with row/column
indexing. We unlist
the 'encoded.text' for the column index, while rep
licate the sequence of the list
with the lengths
of the list
as row
index. cbind
it to make a row/column index matrix, extract the values of 'result' based on the index and assign it to 1
m1 <- cbind(rep(seq_along(encoded.text), lengths(encoded.text)),
unlist(encoded.text))
result[m1] <- 1
result
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 0 0 0 0 0 0 0
#[2,] 1 0 0 0 0 0 1 1 0 0
NOTE: The apply/Map
etc are just loops as for
loop. It wouldn't give any peformance increment except just to add as an answer
n <- 1e6
test <- rep(encoded.text, n)
testresult <- matrix(0, nrow = length(test), ncol = 10)
testresult2 <- copy(testresult)
testresult3 <- copy(testresult)
system.time({
m2 <- cbind(rep(seq_along(test), lengths(test)),
unlist(test))
testresult[m2] <- 1
})
# user system elapsed
# 0.290 0.098 0.388
system.time({
testresult2[do.call(rbind, Map(cbind, seq_len(length(test)), test))] <- 1
})
# user system elapsed
# 8.383 0.462 8.787
system.time({
for (i in 1:length(test)) {
testresult3[i, test[[i]]] <- 1
}
})
# user system elapsed
# 0.648 0.131 0.778
If we increase 'n' and rerun again (after constructing the data)
n <- 1e7
system.time({
m2 <- cbind(rep(seq_along(test), lengths(test)),
unlist(test))
testresult[m2] <- 1
})
# user system elapsed
# 2.699 1.225 3.990 # almost 2 times efficient now
system.time({
testresult2[do.call(rbind, Map(cbind, seq_len(length(test)), test))] <- 1
})
# user system elapsed
# 88.584 5.047 94.384
system.time({
for (i in 1:length(test)) {
testresult3[i, test[[i]]] <- 1
}
})
# user system elapsed
# 5.734 0.742 6.461
-microbenchmark on n <- 1e7
constructed data
ak <- function() {
m2 <- cbind(rep(seq_along(test), lengths(test)),
unlist(test))
testresult[m2] <- 1
}
wfw <- function() {
for (i in 1:length(test)) {
testresult3[i, test[[i]]] <- 1
}
}
library(microbemchmark)
microbenchmark(ak(), wfw(), unit = 'relative', times = 20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# ak() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
# wfw() 1.946415 1.945528 1.927263 1.926645 1.910907 1.940207 20 b