I'd like to create the dataframe output
from dataframe input
in R by running multiple scenarios given two variables x
and y
. The column output
is the sum of all the values in the column value
where there xcol < x & ycol < y
.
input =
xcol ycol value
1 5 4
2 6 9
3 7 8
4 9 7
5 14 8
And
output=
x y results
2 5 0
2 10 4
2 15 35
...
6 5 0
6 10 27
6 15 35
My code is currently like this:
for (x in 2:6) {
if (x%% 2){
next
}
for (y in 5:15) {
if (y %% 5){
next
}
print(x)
print(y)
print(sum(input$value[!is.na(input$xcol) & !is.na(input$ycol) & !is.na(input$value) &
input$xcol < x & input$ycol < y]))
}
}
There should be a better way to replace this nested loop using lapply & sapply and create a dataframe I would think. I'd appreciate any helps.
Thanks
This seems more like a design-of-experiments, in a sense, where you are iterating over different possible values of x
and y
.
xs <- 2:6
ys <- 5:15
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 3 5
# 3 4 5
# 4 5 5
# 5 6 5
# 6 2 6
I think your %%
filtering should be done outside/before this, so:
xs <- xs[!xs %% 2]
ys <- ys[!ys %% 5]
eg <- expand.grid(x = xs, y = ys)
head(eg)
# x y
# 1 2 5
# 2 4 5
# 3 6 5
# 4 2 10
# 5 4 10
# 6 6 10
From here, you can just iterate over the rows:
eg$out <- sapply(seq_len(nrow(eg)), function(r) {
sum(input$value[ complete.cases(input) & input$xcol < eg$x[r] & input$ycol < eg$y[r] ])
})
eg
# x y out
# 1 2 5 0
# 2 4 5 0
# 3 6 5 0
# 4 2 10 4
# 5 4 10 21
# 6 6 10 28
# 7 2 15 4
# 8 4 15 21
# 9 6 15 36
I think your output
variable is a little off, since "2,15" should only include input$value[1]
(x < 2
is the limiting factor). (Other differences exist.)
Regardless of your actual indexing logic, I suggest this methodology over a double-for
or double-lapply
implementation.
NB:
These commands are functionally equivalent with this input
:
complete.cases(input) # 1
complete.cases(input[c("xcol","ycol","value")]) # 2
!is.na(input$xcol) & !is.na(input$xcol) & !is.na(input$value) # 3
I used the first since "code golf", but if your actual input
data.frame contains other columns, you may prefer the second to be more selective of which columns require non-NA
values.
expand.grid
works great for this type of expansion. However, if you are looking at significantly larger datasets (including if your filtering is more complex than %%
offers), than it can be a little expensive as it must create the entire data.frame
in memory. Python's use of lazy iterators would be useful here, in which case you might prefer to use https://stackoverflow.com/a/36144255/3358272 (expanded function in a github gist with some docs: https://gist.github.com/r2evans/e5531cbab8cf421d14ed).