Let D
be a data.frame, with D$x
containing real numbers and D$y
containing booleans, among other fields.
The problem is to sort the rows of D
so that D$x
is non-decreasing, while breaking ties in a way that minimizes the number of discontinuities in the resulting D$y
.
Is there a simple fast way to accomplish this in R?
In a language like C I would first sort by x, then pass over the result sequentially with a 2-state FSM to iron out the discontinuities as far as possible. But in R, I expect iteration to carry unnecessary overhead if there are thousands of rows to process sequentially.
Example correct result:
D$x D$y
1 FALSE
1 FALSE
1 TRUE
1 TRUE
1.2 TRUE
1.5 TRUE
1.5 FALSE
Example incorrect result:
D$x D$y
1 TRUE
1 FALSE
1 TRUE
1 FALSE
1.2 TRUE
1.5 FALSE
1.5 TRUE
In the example, the correct result has 2 discontinuities while the incorrect result has 6.
EDIT: We can assume the data is such that the density of discontinuities in the result will be low: Less than 1 discontinuity per 1000 rows, say.
Much faster solution than sequential iteration (if discontinuities are sparse):
sortForMaxContY <- function(D,initialY){
n <- nrow(D)
D <- D[order(D$x),]
xChanges <- D$x[-1]!=D$x[-n]
isLastOfXVal <- c(xChanges,T)
rankOfXVal <- cumsum(c(T,xChanges))
oldFinalYs <- NA
finalYs <- D$y[isLastOfXVal]
while(!identical(finalYs,oldFinalYs)){
finalYOfPrecedingXVal <- c(initialY,finalYs)[rankOfXVal]
oldFinalYs <- finalYs
D <- D[order(D$x,xor(finalYOfPrecedingXVal,D$y)),]
finalYs <- D$y[isLastOfXVal]
}
return(D)
}
One of sortForMaxContY(D,T)
and sortForMaxContY(D,F)
is the optimum, and the other usually is too, depending on the data.