Search code examples
scalascalaz-stream

How to merge adjacent lines with scalaz-stream without losing the splitting line


Suppose that my input file myInput.txt looks as follows:

~~~ text1
bla bla
some more text
~~~ text2
lorem ipsum
~~~ othertext
the wikipedia
entry is not
up to date

That is, there are documents separated by ~~~. The desired output is as follows:

text1: bla bla some more text
text2: lorem ipsum 
othertext: the wikipedia entry is not up to date

How do I go about that? The following seems pretty unnatural, plus I lose the titles:

 val converter: Task[Unit] =
    io.linesR("myInput.txt")
      .split(line => line.startsWith("~~~"))
      .intersperse(Vector("\nNew document: "))
      .map(vec => vec.mkString(" "))
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("flawedOutput.txt"))
      .run

  converter.run

Solution

  • The following works fine, but it is insanely slow if I run it on more than a toy example (~5 minutes to process 70MB). Is that because I am creating Process's all over the place? Also, it seems to be using only a single core.

      val converter2: Task[Unit] = {
        val docSep = "~~~"
        io.linesR("myInput.txt")
          .flatMap(line => { val words = line.split(" ");
              if (words.length==0 || words(0)!=docSep) Process(line)
              else Process(docSep, words.tail.mkString(" ")) })
          .split(_ == docSep)
          .filter(_ != Vector())
          .map(lines => lines.head + ": " + lines.tail.mkString(" "))
          .intersperse("\n")
          .pipe(text.utf8Encode)
          .to(io.fileChunkW("correctButSlowOutput.txt"))
          .run
      }