Search code examples
rperformancealgorithmstable-sort

Sorting with tie-breaking that minimizes disconinuity of a boolean field


Let D be a data.frame, with D$x containing real numbers and D$y containing booleans, among other fields.

The problem is to sort the rows of D so that D$x is non-decreasing, while breaking ties in a way that minimizes the number of discontinuities in the resulting D$y.

Is there a simple fast way to accomplish this in R?

More Information

In a language like C I would first sort by x, then pass over the result sequentially with a 2-state FSM to iron out the discontinuities as far as possible. But in R, I expect iteration to carry unnecessary overhead if there are thousands of rows to process sequentially.

Example correct result:

D$x  D$y
1    FALSE
1    FALSE
1    TRUE
1    TRUE
1.2  TRUE
1.5  TRUE
1.5  FALSE

Example incorrect result:

D$x  D$y
1    TRUE
1    FALSE
1    TRUE
1    FALSE
1.2  TRUE
1.5  FALSE
1.5  TRUE

In the example, the correct result has 2 discontinuities while the incorrect result has 6.

EDIT: We can assume the data is such that the density of discontinuities in the result will be low: Less than 1 discontinuity per 1000 rows, say.


Solution

  • Much faster solution than sequential iteration (if discontinuities are sparse):

    sortForMaxContY <- function(D,initialY){
        n <- nrow(D)
        D <- D[order(D$x),]
    
        xChanges <- D$x[-1]!=D$x[-n]
        isLastOfXVal <- c(xChanges,T)
        rankOfXVal <- cumsum(c(T,xChanges))
    
        oldFinalYs <- NA
        finalYs <- D$y[isLastOfXVal]
        while(!identical(finalYs,oldFinalYs)){
            finalYOfPrecedingXVal <- c(initialY,finalYs)[rankOfXVal]
            oldFinalYs <- finalYs
            D <- D[order(D$x,xor(finalYOfPrecedingXVal,D$y)),]
            finalYs <- D$y[isLastOfXVal]
        }
        return(D)
    }
    

    One of sortForMaxContY(D,T) and sortForMaxContY(D,F) is the optimum, and the other usually is too, depending on the data.