Search code examples
scalafor-loopif-statementwhile-looparraybuffer

Combine multiple sequential entries in Scala/Spark


I have an array of numbers separated by comma as shown:

a:{108,109,110,112,114,115,116,118}

I need the output something like this:

a:{108-110, 112, 114-116, 118}

I am trying to group the continuous numbers with "-" in between. For example, 108,109,110 are continuous numbers, so I get 108-110. 112 is separate entry; 114,115,116 again represents a sequence, so I get 114-116. 118 is separate and treated as such.

I am doing this in Spark. I wrote the following code:

import scala.collection.mutable.ArrayBuffer

def Sample(x:String):ArrayBuffer[String]={
  val x1 = x.split(",")
  var a:Int = 0
  var present=""
  var next:Int = 0
  var yrTemp = ""
  var yrAr= ArrayBuffer[String]()
  var che:Int = 0
  var storeV = ""
  var p:Int = 0 
  var q:Int = 0

  var count:Int = 1

  while(a < x1.length)
  {
      yrTemp = x1(a)

      if(x1.length == 1)
      {
          yrAr+=x1(a)
      }
      else
      if(a < x1.length - 1)
       {
           present = x1(a)
          if(che == 0)
          {
                storeV = present
          }

          p = x1(a).toInt
          q = x1(a+1).toInt

          if(p == q)
          {
              yrTemp = yrTemp
              che = 1
          }
          else
          if(p != q)
             {
                 yrTemp = storeV + "-" + present 
                 che = 0
                 yrAr+=yrTemp
             }

       }
       else
            if(a == x1.length-1)
            {
                present = x1(a)
                yrTemp = present 
                che = 0
                yrAr+=yrTemp
            }
      a = a+1
  }
yrAr
}
val SampleUDF = udf(Sample(_:String))

I am getting the output as follows:

a:{108-108, 109-109, 110-110, 112, 114-114, 115-115, 116-116, 118}

I am not able to figure out where I am going wrong. Can you please help me in correcting this. TIA.


Solution

  • Here's another way:

    def rangeToString(a: Int, b: Int) = if (a == b) s"$a" else s"$a-$b"
    
    def reduce(xs: Seq[Int], min: Int, max: Int, ranges: Seq[String]): Seq[String] = xs match {
        case y +: ys if (y - max <= 1) => reduce(ys, min, y, ranges)
        case y +: ys                   => reduce(ys, y, y, ranges :+ rangeToString(min, max))
        case Seq()                     => ranges :+ rangeToString(min, max)
    }
    
    def output(xs: Array[Int]) = reduce(xs, xs.head, xs.head, Vector())//.toArray
    

    Which you can test:

    println(output(Array(108,109,110,112,114,115,116,118)))
      // Vector(108-110, 112, 114-116, 118)
    

    Basically this is a tail recursive function - i.e. you take your "variables" as the input, then it calls itself with updated "variables" on each loop. So here xs is your array, min and max are integers used to keep track of the lowest and highest numbers so far, and ranges is the output sequence of Strings that gets added to when required.

    The first pattern (y being the first element, and ys being the rest of the sequence - because that's how the +: extractor works) is matched if there's at least one element (ys can be an empty list) and it follows on from the previous maximum.

    The second is if it doesn't follow on, and needs to reset the minimum and add the completed range to the output.

    The third case is where we've got to the end of the input and just output the result, rather than calling the loop again.

    Internet karma points to anyone who can work out how to eliminate the duplication of ranges :+ rangeToString(min, max)!