Here is a sampling of my data, which is in a matrix:
BLOCK RUNTIME
101 50
101 20
101 -25
101 -40
101 35
101 45
202 25
202 -10
202 -35
202 40
202 50
202 30
202 -20
202 15
.
.
.
n
The desired output of my yet-to-be code is this:
BLOCK TIME_CHUNKS
101 70
101 -65
101 80
202 25
202 -45
202 120
202 -20
202 15
What I want the code to do to get the output: for as long as subsequent rows' BLOCK values are equal to the current row's BLOCK value, AND for as long as subsequent rows' RUNTIME values are the same sign, sum them, and populate a new row in a table with the BLOCK value and the summed value. Then proceed where you left off. In the example data, the first two rows (50, 20) are summed and written to a table. Then the sign of RUNTIME switches and is constant for two rows (-25, -40), so these values are summed. Then the sign of RUNTIME switches again and is positive for three rows (35, 45, 25), but the last row of these three has a different BLOCK number, so only the first two of these three rows are summed and written to the table. Proceed until the end of the matrix is reached.
It's easy enough for me to make the conditional statements, but I don't know how to "keep track" of where I am in the matrix to restart the summing process. I'm not that well-versed in for or while loops to even make an educated guess at how they could be used here. I made a pathetic attempt at writing a function, but didn't get anywhere.
Need to automate this process...I've got about 10,000 rows of data, which can be dynamically produced based on a set of scenario variables. Each set of variables will produce a table with different values for RUNTIME, and I've got a lot of scenarios to run.
Much appreciation for any help.
TIME_CHUNKS <- with(df, tapply(RUNTIME, BLOCK, function(x)
tapply(x, cumsum(c(1, diff(sign(x)) != 0)), sum)))
out <- data.frame(BLOCK = rep.int(unique(df$BLOCK), sapply(TIME_CHUNKS, length)),
TIME_CHUNKS = unlist(TIME_CHUNKS), row.names = NULL)
Use this data and paste the above code
df <- data.frame(BLOCK = c(101, 101, 101, 101, 101, 101, 202, 202,
202, 202, 202, 202, 202, 202), RUNTIME = c(50, 20, -25, -40,
35, 45, 25, -10, -35, 40, 50, 30, -20, 15))
and you will get out
to be:
BLOCK TIME_CHUNKS
101 70
101 -65
101 80
202 25
202 -45
202 120
202 -20
202 15
Notice the nested use of tapply
. In words, what the above code is doing to calculate TIME_CHUNKS
is:
Split up each row by BLOCK
, and let x
be the vectors for a given block (for example,
x
will at first be c(50, 20, -25, -40, 35, 45)
.
The quizzical looking cumsum(c(1, diff(sign(x)) != 0))
simply sub-divides our block
into consecutive groups of same-signed numbers. Namely, diff(sign(x)) != 0
gives a vector of
TRUE
s and FALSE
s according as the sign switches or not, and coercing to integer in
combination with cumsum
yields a vector that gives a different number for each sub-sequence
with the same sign. Applying a sum along each subsequence gives the result we're looking for.