I have written a for loop in my code
for(i in 2:nrow(ProductionWellYear2)) {
if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}
else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}
}
However, this is very time intensive as this dataframe has over 800k rows. How can I make this quicker and avoid the for loop?
This should work for you, but without seeing your data I can't verify the results are what you want. That being said, there's really not much different here in the process as originally written, but benchmarking does seem to show it is faster with my example data, but not necessarily "fast".
library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
B = as.integer(rnorm(2500)),
C = as.integer(rnorm(2500))
)
# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear
# Slightly modified original version
method1 <- function() {
cols <- ncol(ProductionWellYear)
for(i in 2:nrow(ProductionWellYear)) {
if (ProductionWellYear[i, cols] == 0) {
ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
}
else {
ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
}
}
}
# New version
method2 <- function() {
cols <- ncol(ProductionWellYear2)
sapply(2:nrow(ProductionWellYear2), function(i) {
if (ProductionWellYear2[i, cols] == 0) {
ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
}
})
}
# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE
result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904 100
# method2() 45.56065 53.7744 67.55549 59.9299 72.81873 174.1417 100