I have a csv file and I want to read it line by line. The problem is that some cell values are in quotes containing line breaks.
Here is an example CSV:
Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200
The standard getLines() function does not handle that.
Source.fromFile(inputFile).getLines() // will split at every line break, regardless if quoted or not
The getLines gets something like:
Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")
But it should be like:
Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")
I tried it to read the file fully and split is with a RegEx similar to this post https://stackoverflow.com/a/31193505
file.mkString.split("""\n(?=(?:[^"]*"[^"]*")*[^"]*$)""")
The regex works fine, but I'm getting a stack overflow exception because the file is too big to handle it fully out of memory. I tried it with a smaller version of the file and it worked.
As stated in the post, a foldLeft() could help for bigger files. But I'm not sure how it should work, when iterating over every Char of the string, to pass all at once...
Maybe it works to write an own tail recursive version of the getLines, but I'm not sure if there isn't a more practical solution instead of handling it char by char.
Do you see any other functional-style solution to this problem?
Tanks and regards, Felix
The most simple answer is finding an external lib to do it !
If it's not a solution for you, the foldLeft solution is imo the best functional style ! Here's a simple version :
val lines = Source.fromFile(inputFile).getLines()
lines.foldLeft[(Seq[String], String)](Nil, "") {
case ((accumulatedLines, accumulatedString), newLine) => {
val isInAnOpenString = accumulatedString.nonEmpty
val lineHasOddQuotes = newLine.count(_ == '"') % 2 == 1
(isInAnOpenString, lineHasOddQuotes) match {
case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
case (true, false) => accumulatedLines -> (accumulatedString + newLine)
case (false, true) => accumulatedLines -> newLine
case (false, false) => (accumulatedLines :+ newLine) -> ""
}
}
}._1
Note that this version won't handle too much special cases, like having multiple values on a line that contain multiple lines, but it should give you a good start idea.
The main idea is to foldLeft on pretty much everything you need to keep in memory, and from that change your state progressively.
As you can see, inside the foldLeft you can have as much logic as needed. In this case, I added extra booleans and a nest match case for readibility for example.
So my advice would be : foldLeft, and don't panic !