I want to read a generic CSV file with headers but unknown column number into a typed structure. My question is kind of the same as Strongly typed access to csv in scala? but with the fact I would have no schema to pass to the parser...
Until now, I was using Jackson CSV mapper to read each row as a Map[String,String], and it was working well.
import com.fasterxml.jackson.module.scala.DefaultScalaModule
def genericStringIterator(input: InputStream): Iterator[Map[String, String]] = {
val mapper = new CsvMapper()
mapper.registerModule(DefaultScalaModule)
val schema = CsvSchema.emptySchema.withHeader
val iterator = mapper
.readerFor(classOf[Map[String, String]])
.`with`(schema)
.readValues[Map[String, String]](input)
iterator.asScala
}
Now, we need the field to be typed, so 4.2 would be a Double but "4.2" would still be a String.
We are using play-json everywhere in our project, and so I know JsValue already has a great type inference for generic stuff like that.
As paly-json it is based on Jackson too, I thought it would be great to have something like
import play.api.libs.json.jackson.PlayJsonModule
def genericStringIterator(input: InputStream): Iterator[JsValue] = {
val mapper = new CsvMapper()
mapper.registerModule(PlayJsonModule)
val schema = CsvSchema.emptySchema.withHeader
val iterator = mapper
.readerFor(classOf[JsValue])
.`with`(schema)
.readValues[JsValue](input)
iterator.asScala
}
But when I try the former code, I get an exception :
val iterator = CSV.genericAnyIterator(input(
"""foo,bar,baz
|"toto",42,43
|"tata",,45
| titi,87,88
|"tutu",,
|""".stripMargin))
iterator
.foreach { a =>
println(a)
}
java.lang.RuntimeException: We should be reading map, something got wrong
at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:165)
at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:128)
at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:123)
at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:277)
at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:192)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:40)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at my.company.csv.CSVSpec$$anon$4.<init>(CSVSpec.scala:240)
Is there something I'm doing wrong ?
I don't care particulary to have a play-json JsValue in the end, any Json structure with generic typed field would be ok. Is there another lib I could use for that ? For what I found, all other libs are based on a mapping given to the CSV Reader in advance, and what is important for me is to be able to infer the type from the CSV.
Ok, I was lazy to want to find something working out of the box :) In fact it was easy to implement it myself.
I went looking to lib in other languages that does this infering (PapaParse in JS, Pandas in python), and discovered that they were doing a test-and-retry with priority to guess the types.
So I implemented it myself, and it works fine !
Here it is:
def genericAnyIterator(input: InputStream): Iterator[JsValue] = {
// the result of former code, mapping to Map[String,String]
val strings = genericStringIterator(input)
val decimalRegExp: Regex = "(\\d*){1}(\\.\\d*){0,1}".r
val jsValues: Iterator[Map[String, JsValue]] = strings.map { m =>
m.mapValues {
case "" => None
case "false" | "FALSE" => Some(JsFalse)
case "true" | "TRUE" => Some(JsTrue)
case value@decimalRegExp(g1, g2) if !value.isEmpty => Some(JsNumber(value.toDouble))
case "null" | "NULL" => Some(JsNull)
case value@_ => Some(JsString(value))
}
.filter(_._2.isDefined)
.mapValues(_.get)
}
jsValues.map(map => JsObject(map.toSeq))
}
which does in a test
it should "read any csv in JsObject" in new WithInputStream {
val iterator = CSV.genericAnyIterator(input(
"""foo,bar,baz
|"toto",NaN,false
|"tata",,TRUE
|titi,87.79,88
|"tutu",,null
|"tete",5.,.5
|""".stripMargin))
val result: Seq[JsValue] = iterator.toSeq
result should be(Stream(
Json.obj("foo" -> "toto", "bar" -> "NaN", "baz" -> false)
, Json.obj("foo" -> "tata", "baz" -> true)
, Json.obj("foo" -> "titi", "bar" -> 87.79, "baz" -> 88)
, Json.obj("foo" -> "tutu", "baz" -> JsNull)
, Json.obj("foo" -> "tete", "bar" -> 5, "baz" -> 0.5)
) )
}