Search code examples
javascalaapache-sparkinputstreamapache-flink

In Scala, how to read bytes from binary file delimited by characters?


In Scala, given a binary file, I am interested in retrieving a list of Array[Byte] items.

For example, the binary file has items delimited by the characters/bytes 'my-delimiter'.

How can I get a list of Array[Byte] for each item?


Solution

  • Functional solution, with help of java.nio:

    import java.nio.file.{Files, Paths}
    
    object Main {
    
      private val delimiter = '\n'.toByte
    
      def main(args: Array[String]): Unit = {
        val byteArray = Files.readAllBytes(Paths.get(args(0)))
    
        case class Accumulator(result: List[List[Byte]], current: List[Byte])
    
        val items: List[Array[Byte]] = byteArray.foldLeft(Accumulator(Nil, Nil)) {
          case (Accumulator(result, current), nextByte) =>
            if (nextByte == delimiter)
              Accumulator(current :: result, Nil)
            else
              Accumulator(result, nextByte :: current)
        } match {
          case Accumulator(result, current) => (current :: result).reverse.map(_.reverse.toArray)
        }
        items.foreach(item => println(new String(item)))
      }
    
    }
    

    This solution is expected to have poor performance though. How important is that for you ? How many files, of what size and how often will you read? If performance is important, than you should rather use input streams and mutable collections:

    import java.io.{BufferedInputStream, FileInputStream}
    
    import scala.collection.mutable.ArrayBuffer
    
    object Main {
    
      private val delimiter = '\n'.toByte
    
      def main(args: Array[String]): Unit = {
        val items = ArrayBuffer.empty[Array[Byte]]
        val item = ArrayBuffer.empty[Byte]
        val bis = new BufferedInputStream(new FileInputStream(args(0)))
        var nextByte: Int = -1
        while ( { nextByte = bis.read(); nextByte } != -1) {
          if (nextByte == delimiter) {
            items.append(item.toArray)
            item.clear()
          } else {
            item.append(nextByte.toByte)
          }
        }
        items.append(item.toArray)
        items.foreach(item => println(new String(item)))
        bis.close()
      }
    
    }