Search code examples
rmaxsubsetsequence

What is the best way to find the biggest subset of numbers separated at least by N?


I am interessed in obtaining the group of maximum size whose elements are all spaced at least N units (N possibly decimal) between them, I would like to have something with the following input/output

1 2 3 4 5, spaced by 2
1 3 5 | 2 4

35 40 45 50 55 60 65 70 75, spaced by 10
35 45 55 65 75 | 40 50 60 70

37.5 39.5 40.5 57.7 62 76.3, spaced by 3.5
37.5 57.7 62 76.3 | 39.5 57.7 62 76.3

What I have tried is to use the following:

split(vector,vector%%spacing)

And it seemed to work, but I think that the modulus of a double is responsible of a pretty weird behaviour, where the same values introduced to a function give a different result wether if they come from the 34th line of a data.frame, or if they are passed directly to the function... I prepared this snippet so anyone could try to replicate the behaviour:

calculate_solution <- function(parA, parB, parC, parD) {
  varA <- parA/2
  varB <- seq(from=varA+parB,to=parA-parB,by=parB)

  varC <- 1 / parC

  varD <- split(varB,varB%%varC)
  
  print(varD)
}

df_1 <- list(
  a=seq(from=75,to=85,by=5),
  b=seq(from=1,to=2.5,by=0.5),
  c=seq(from=0.05,to=0.4,by=0.05),
  d=seq(from=2,to=2,by=1)) %>%
  expand.grid()

print(c(df_1[34,]$a,df_1[34,]$b,df_1[34,]$c,df_1[34,]$d))

#[1] 75.00  2.50  0.15  2.00

calculate_solution(df_1[34,]$a,df_1[34,]$b,df_1[34,]$c,df_1[34,]$d)

#$`3.5527136788005e-15`
#[1] 40
#
#$`5.32907051820075e-15`
#[1] 60
#
#$`0.833333333333337`
#[1] 47.5
#
#$`0.833333333333339`
#[1] 67.5
#
#$`1.66666666666667`
#[1] 55
#
#$`2.5`
#[1] 42.5
#
#$`2.50000000000001`
#[1] 62.5
#
#$`3.33333333333334`
#[1] 50 70
#
#$`4.16666666666667`
#[1] 57.5
#
#$`5`
#[1] 45
#
#$`5.00000000000001`
#[1] 65
#
#$`5.83333333333334`
#[1] 52.5 72.5
#

df_2 <- data.frame(a=75.0,b=2.5,c=0.15,d=2.0)

calculate_solution(df_2[1,]$a,df_2[1,]$b,df_2[1,]$c,df_2[1,]$d)

#$`0.83333333333333`
#[1] 67.5
#
#$`0.833333333333331`
#[1] 47.5
#
#$`1.66666666666666`
#[1] 55
#
#$`2.5`
#[1] 42.5 62.5
#
#$`3.33333333333333`
#[1] 50 70
#
#$`4.16666666666666`
#[1] 57.5
#
#$`5`
#[1] 45 65
#
#$`5.83333333333333`
#[1] 52.5 72.5
#
#$`6.66666666666666`
#[1] 60
#
#$`6.66666666666667`
#[1] 40

Instead of trying to find the reason behind this behaviour, I thought that maybe I can find another way of achieving what I want, or at least maybe discover the right terminology to refer to what I am trying to do.


Solution

  • This function starts by finding the first vector where each element is at least diff away from the previous element. It then repeats this process starting with the second and subsequent elements, stopping when either a subset will be formed (the nth element is found in one of the previous solutions) or there are no candidate sub-vectors left.

    func <- function(y, diff = 1, only_longest = TRUE) {
      fun0 <- function(x, diff) {
        if (length(x) < 2) return(x)
        lenx <- length(x)
        x <- sort(x) # just in case
        i <- 1
        leni <- 1L
        while ( (x[lenx] - i[leni]) >= diff ) {
          i1 <- which( x[-seq_len(i[leni])] - x[i[leni]] >= diff )
          if (length(i1)) i <- c(i, i[leni] + i1[1]) else break
          leni <- length(i)
        }
        x[i]
      }
      leny <- length(y)
      i <- 1L
      out <- list()
      while (!y[i] %in% unlist(out) && length(cand <- fun0(y[seq(i, leny)], diff)) > 0) {
        out[[i]] <- cand
        i <- i + 1L
      }
      lens <- lengths(out)
      if (only_longest) out <- out[lens == max(lens)]
      out
    }
    

    Demonstration:

    func(1:5, 2)
    # [[1]]
    # [1] 1 3 5
    func(seq(35,75,5), 10)
    # [[1]]
    # [1] 35 45 55 65 75
    func(c(37.5, 39.5, 40.5, 57.7, 62, 76.3), 3.5)
    # [[1]]
    # [1] 37.5 57.7 62.0 76.3
    # [[2]]
    # [1] 39.5 57.7 62.0 76.3
    # [[3]]
    # [1] 40.5 57.7 62.0 76.3
    

    The only_longest= argument was a thought at one point. The presumption is that if any of the vectors are shorter than the longest, then you might or might not be interested in knowing it. Just a thought, easily removed.

    func(1:5, 2, only_longest = FALSE)
    # [[1]]
    # [1] 1 3 5
    # [[2]]
    # [1] 2 4
    
    func(seq(35,75,5), 10, only_longest = FALSE)
    # [[1]]
    # [1] 35 45 55 65 75
    # [[2]]
    # [1] 40 50 60 70