Search code examples
roptimizationmathematical-optimization

Split a vector into chunks such that sum of each chunk is approximately constant


I have a large data frame with more than 100 000 records where the values are sorted

For example, consider the following dummy data set

df <- data.frame(values = c(1,1,2,2,3,4,5,6,6,7))

I want to create 3 groups of above values (in sequence only) such that the sum of each group is more or less the same

So for the above group, if I decide to divide the sorted df in 3 groups as follows, their sums will be

1. 1 + 1 + 2 +2 + 3 + 4 = 13
2. 5 + 6 = 11
3. 6 + 7 = 13

How can create this optimization in R? any logic?


Solution

  • So, let's use pruning. I think other solutions are giving a good solution, but not the best one.

    First, we want to minimize

    enter image description here

    where S_n is the cumulative sum of the first n elements.

    computeD <- function(p, q, S) {
      n <- length(S)
      S.star <- S[n] / 3
      if (all(p < q)) {
        (S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
      } else {
        stop("You shouldn't be here!")
      }
    }
    

    I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).

    optiCut <- function(v) {
      S <- cumsum(v)
      n <- length(v)
      S_star <- S[n] / 3
      # good starting values
      p_star <- which.min((S - S_star)^2)
      q_star <- which.min((S - 2*S_star)^2)
      print(min <- computeD(p_star, q_star, S))
      
      count <- 0
      for (q in 2:(n-1)) {
        S3 <- S[n] - S[q] - S_star
        if (S3*S3 < min) {
          count <- count + 1
          D <- computeD(seq_len(q - 1), q, S)
          ind = which.min(D);
          if (D[ind] < min) {
            # Update optimal values
            p_star = ind;
            q_star = q;
            min = D[ind];
          }
        }
      }
      c(p_star, q_star, computeD(p_star, q_star, S), count)
    }
    

    This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10)).


    For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!

    optiCut_K <- function(v, K) {
      
      S <- cumsum(v)
      n <- length(v)
      S_star <- S[n] / K
      # good starting values
      p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
      min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)
      
      compute_children <- function(level, ind, val) {
        
        # leaf
        if (level == 1) {
          val <- val + (S[ind] - S_star)^2
          if (val > min_first) {
            return(NULL)
          } else {
            return(val)
          } 
        } 
        
        P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
        inds <- which(P_all < min_first)
        if (length(inds) == 0) return(NULL)
        
        node <- tibble::tibble(
          level = level - 1,
          ind = inds,
          val = P_all[inds]
        )
        node$children <- purrr::pmap(node, compute_children)
        
        node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
        `if`(nrow(node) == 0, NULL, node)
      }
      
      compute_children(K, n, 0)
    }
    

    This gives you all the solution that are least better than the greedy one:

    v <- sort(sample(1:1000, 1e5, replace = TRUE))
    test <- optiCut_K(v, 9)
    

    You need to unnest this:

    full_unnest <- function(tbl) {
      tmp <- try(tidyr::unnest(tbl), silent = TRUE)
      `if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
    }
    print(test <- full_unnest(test))
    

    And finally, to get the best solution:

    test[which.min(test$children), ]