Search code examples
javascalafunctional-programmingnio

Splitting a large log file in to multiple files in Scala


I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.

I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:

  1. Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.

  2. Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.

Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.


Solution

  • You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:

    import java.io._
    import scala.io._
    
    Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) { 
        case (printers, line) =>
            val id = line.split(" ").head
            val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
            printer.println(line)
            printers.updated(id, printer)
    }.values.foreach(_.close)
    

    Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.