Search code examples
scalacsvline-breaks

Reading CSV file with multi line strings in Scala


I have a csv file and I want to read it line by line. The problem is that some cell values are in quotes containing line breaks.

Here is an example CSV:

Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200

The standard getLines() function does not handle that.

Source.fromFile(inputFile).getLines()  // will split at every line break, regardless if quoted or not

The getLines gets something like:

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")

But it should be like:

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")

I tried it to read the file fully and split is with a RegEx similar to this post https://stackoverflow.com/a/31193505

file.mkString.split("""\n(?=(?:[^"]*"[^"]*")*[^"]*$)""")

The regex works fine, but I'm getting a stack overflow exception because the file is too big to handle it fully out of memory. I tried it with a smaller version of the file and it worked.

As stated in the post, a foldLeft() could help for bigger files. But I'm not sure how it should work, when iterating over every Char of the string, to pass all at once...

  1. the Char of your current iteration
  2. the Line what you are building
  3. and the List of already created lines

Maybe it works to write an own tail recursive version of the getLines, but I'm not sure if there isn't a more practical solution instead of handling it char by char.

Do you see any other functional-style solution to this problem?

Tanks and regards, Felix


Solution

  • The most simple answer is finding an external lib to do it !

    If it's not a solution for you, the foldLeft solution is imo the best functional style ! Here's a simple version :

      val lines = Source.fromFile(inputFile).getLines()
    
      lines.foldLeft[(Seq[String], String)](Nil, "") {
        case ((accumulatedLines, accumulatedString), newLine) => {
          val isInAnOpenString = accumulatedString.nonEmpty
          val lineHasOddQuotes =  newLine.count(_ == '"') % 2 == 1
          (isInAnOpenString, lineHasOddQuotes) match {
            case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
            case (true, false) => accumulatedLines -> (accumulatedString + newLine)
            case (false, true) => accumulatedLines -> newLine
            case (false, false) => (accumulatedLines :+ newLine) -> ""
          }
        }
      }._1
    
    

    Note that this version won't handle too much special cases, like having multiple values on a line that contain multiple lines, but it should give you a good start idea.

    The main idea is to foldLeft on pretty much everything you need to keep in memory, and from that change your state progressively.

    As you can see, inside the foldLeft you can have as much logic as needed. In this case, I added extra booleans and a nest match case for readibility for example.

    So my advice would be : foldLeft, and don't panic !