I have a large dataset in memory, with approximately 400k rows. Working on a subset of this dataframe, I'd like to generate a large image, and set elements in that image to be equal to a particular value, based on entries in the data frame. I've done this very simply, and undoubtedly stupidly, using a for
loop:
library('Matrix')
#saveMe is a subset of the dataframe containing the x-ranges I want
#in columns 1,2; y-ranges in 3-4, and values in 5.
saveMe<-structure(list(XMin = c(1, 17, 19, 19, 21, 29, 29, 31, 31, 31, 31, 33, 33, 35, 37, 39, 39, 39, 41, 43), XMax = c(9, 15, 1, 3,1, 17, 37, 5, 13, 25, 35, 17, 43, 23, 47, 25, 25, 33, 21, 29), YMin = c(225, 305, 435, 481, 209, 1591, 157, 115, 1, 691, 79, 47, 893, 1805, 809, 949, 2179, 1733, 339, 739), YMax = c(277,315, 435, 499, 213, 1689, 217, 133, 1, 707, 111, 33, 903,1827, 849, 973, 2225, 1723, 341, 765), Value = c(3, 1, 0,1, 1, 4, 3, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 2, 0, 0)), .Names = c("XMin", "XMax", "YMin", "YMax", "Value"),class = c("data.table", "data.frame"), row.names = c(NA, -20L))
#Create sparse matrix to store the result:
xMax <- max(saveMe$XMax) - min(saveMe$XMin)+1
yMax <- max(saveMe$YMax) - min(saveMe$YMin)+1
img<-Matrix(0, nrow = xMax, ncol = yMax, sparse = TRUE)
for (kx in 1:nrow(saveMe)) {
img[as.numeric(saveMe[kx,1]):as.numeric(saveMe[kx,2]), as.numeric(saveMe[kx,3]):as.numeric(saveMe[kx,4])] <- as.numeric(saveMe[kx,5])
}
nnzero(img)
image(img)
This takes a really long time -- about five hours -- and is dumb, iterating row-wise. I know that typically one can use apply to speed things up hugely. So, I've tried to do this, much as you might expect:
img<-Matrix(0, nrow = xMax, ncol = yMax, sparse = TRUE)
apFun <- function(x, imToUse){
#idea is to then change that to something like...
imToUse[(x[1]:x[2]), (x[3]:x[4]) ] <- x[5]
}
apply(as.matrix(saveMe), 1, apFun,imToUse=img);
nnzero(img)
image(img)
However, whatever I try the resulting elements in img
are always zero. I think this might be a variable scoping issue. What am I doing wrong?
As an aside, the problem I really want to solve is to create an integer "sparse image" for this data, where everything is zero aside from the elements in the rectangle bounded by [XMin XMax YMin YMax]
which are equal to value
(i.e. x[5]
). Is there a better way of doing this?
Your suspicions are correct. Try this to convince yourself:
f <- function(x){
x <- 5
}
x <- 4
f(x)
# Nothing is returned
x
# [1] 4
y <- f(x)
x
# [1] 4
y
# [1] 5
For your function, since you're not assigning the result in apply()
, you want to add the object you updated at the end as the return value.
apFun <- function(x, imToUse){
#idea is to then change that to something like...
imToUse[(x[1]:x[2]), (x[3]:x[4]) ] <- x[5]
imToUse
}
This is similar to
rm(x, y)
f <- function(x){
x <- 5
x
}
x <- 4
f(x)
# [1] 5
x
# [1] 4
Notice that you are STILL not updating x. But you are returning a value.
EDIT:
On review of the purpose of your function and your call to apply
, I'd recommend you stick with your original for loop. The intent of your call to apply
is to update the values in an object in the parent environment. In this case, since the benefit of apply
is the convenience of a wrapper to a loop and the protection of a local environment, you have to go through a series of contortions, to get out of that protected wrapper.
HOW TO SPEED IT UP: change your for loop to this
for (i in seq_len(nrow(saveMe))){
img[saveMe[[i,1]]:saveMe[[i,2]], saveMe[[i,3]]:saveMe[[i,4]]] <- saveMe[[i,5]]
}
Where is that saving you time? The big time savings here is using [[
to extract a single value from a data table based on an index rather than using [
. Here's the data:
You're looking up 5 single values in a data table of 400,000 rows, using the row and column integer index (so that's 2,000,000 lookups in your loop) and assigning an array based on those values 400,000 times. The assignment might be hard to optimize, but the lookup is not. Lets run 100 trials each of a integer index lookup in a data table and an assignment of that single value, comparing the [
and the [[
operator.
DT <- data.table(x = sample(5000))
single <- replicate(100, {
system.time({
for (i in seq_len(nrow(DT))){
z <- DT[i,1]
}
})
})
double <- replicate(100, {
system.time({
for (i in seq_len(nrow(DT))){
z <- DT[[i,1]]
}
})
})
rowMeans(single)
# user.self sys.self elapsed user.child sys.child
# 1.69405 0.03519 1.89836 0.00000 0.00000
rowMeans(double)
# user.self sys.self elapsed user.child sys.child
# 0.05047 0.00083 0.05668 0.00000 0.00000
The key value here is user.self
. You can see that using the [[
to extract the value is about 30 times faster, based on 100 trials.