Search code examples
jsonscalacsvjacksonplay-json

discover type from CSV columns in scala


I want to read a generic CSV file with headers but unknown column number into a typed structure. My question is kind of the same as Strongly typed access to csv in scala? but with the fact I would have no schema to pass to the parser...

Until now, I was using Jackson CSV mapper to read each row as a Map[String,String], and it was working well.

import com.fasterxml.jackson.module.scala.DefaultScalaModule

def genericStringIterator(input: InputStream): Iterator[Map[String, String]] = {

    val mapper = new CsvMapper()

    mapper.registerModule(DefaultScalaModule)

    val schema = CsvSchema.emptySchema.withHeader

    val iterator = mapper
      .readerFor(classOf[Map[String, String]])
      .`with`(schema)
      .readValues[Map[String, String]](input)

    iterator.asScala
  }

Now, we need the field to be typed, so 4.2 would be a Double but "4.2" would still be a String.

We are using play-json everywhere in our project, and so I know JsValue already has a great type inference for generic stuff like that.

As paly-json it is based on Jackson too, I thought it would be great to have something like

import play.api.libs.json.jackson.PlayJsonModule


def genericStringIterator(input: InputStream): Iterator[JsValue] = {

    val mapper = new CsvMapper()

    mapper.registerModule(PlayJsonModule)

    val schema = CsvSchema.emptySchema.withHeader

    val iterator = mapper
      .readerFor(classOf[JsValue])
      .`with`(schema)
      .readValues[JsValue](input)

    iterator.asScala
  }

But when I try the former code, I get an exception :

   val iterator = CSV.genericAnyIterator(input(
      """foo,bar,baz
        |"toto",42,43
        |"tata",,45
        | titi,87,88
        |"tutu",,
        |""".stripMargin))

    iterator
      .foreach { a =>
        println(a)
      }
java.lang.RuntimeException: We should be reading map, something got wrong
    at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:165)
    at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:128)
    at play.api.libs.json.jackson.JsValueDeserializer.deserialize(JacksonJson.scala:123)
    at com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:277)
    at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:192)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:40)
    at scala.collection.Iterator.foreach(Iterator.scala:929)
    at scala.collection.Iterator.foreach$(Iterator.scala:929)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
    at my.company.csv.CSVSpec$$anon$4.<init>(CSVSpec.scala:240)

Is there something I'm doing wrong ?

I don't care particulary to have a play-json JsValue in the end, any Json structure with generic typed field would be ok. Is there another lib I could use for that ? For what I found, all other libs are based on a mapping given to the CSV Reader in advance, and what is important for me is to be able to infer the type from the CSV.


Solution

  • Ok, I was lazy to want to find something working out of the box :) In fact it was easy to implement it myself.

    I went looking to lib in other languages that does this infering (PapaParse in JS, Pandas in python), and discovered that they were doing a test-and-retry with priority to guess the types.

    So I implemented it myself, and it works fine !

    Here it is:

    def genericAnyIterator(input: InputStream): Iterator[JsValue] = {
    
        // the result of former code, mapping to Map[String,String]
        val strings = genericStringIterator(input)
    
        val decimalRegExp: Regex = "(\\d*){1}(\\.\\d*){0,1}".r
    
        val jsValues: Iterator[Map[String, JsValue]] = strings.map { m =>
          m.mapValues {
            case "" => None
            case "false" | "FALSE" => Some(JsFalse)
            case "true" | "TRUE" => Some(JsTrue)
            case value@decimalRegExp(g1, g2) if !value.isEmpty => Some(JsNumber(value.toDouble))
            case "null" | "NULL" => Some(JsNull)
            case value@_ => Some(JsString(value))
          }
            .filter(_._2.isDefined)
            .mapValues(_.get)
        }
        jsValues.map(map => JsObject(map.toSeq))
      }
    
    

    which does in a test

     it should "read any csv in JsObject" in new WithInputStream {
    
        val iterator = CSV.genericAnyIterator(input(
          """foo,bar,baz
            |"toto",NaN,false
            |"tata",,TRUE
            |titi,87.79,88
            |"tutu",,null
            |"tete",5.,.5
            |""".stripMargin))
    
        val result: Seq[JsValue] = iterator.toSeq
    
        result should be(Stream(
          Json.obj("foo" -> "toto", "bar" -> "NaN", "baz" -> false)
          , Json.obj("foo" -> "tata",  "baz" -> true)
          , Json.obj("foo" -> "titi", "bar" -> 87.79, "baz" -> 88)
          , Json.obj("foo" -> "tutu",  "baz" -> JsNull)
          , Json.obj("foo" -> "tete", "bar" -> 5, "baz" -> 0.5)
        ) )
      }