Search code examples
jsonxmlscalascala-catscirce

Transform JSON tree to other format (XML, CSV etc.) recursively with circe


In order to transform JSON nodes to an other format than JSON (like XML, CSV etc.) with circe I came up with a solution where I had to access internal data structures of circe.

This is my working sample that transforms JSON to a XML String (not perfect but you get the idea):

package io.circe

import io.circe.Json.{JArray, JBoolean, JNull, JNumber, JObject, JString}
import io.circe.parser.parse

object Sample extends App {

  def transformToXMLString(js: Json): String = js match {
    case JNull => ""
    case JBoolean(b) => b.toString
    case JNumber(n) => n.toString
    case JString(s) => s.toString
    case JArray(a) => a.map(transformToXMLString(_)).mkString("")
    case JObject(o) => o.toMap.map {
      case (k, v) => s"<${k}>${transformToXMLString(v)}</${k}>"
    }.mkString("")
  }

  val json =
    """{
      | "root": {
      |  "sampleboolean": true,
      |  "sampleobj": {
      |    "anInt": 1,
      |    "aString": "string"
      |  },
      |  "objarray": [
      |     {"v1": 1},
      |     {"v2": 2}
      |  ]
      | }
      |}""".stripMargin

  val res = transformToXMLString(parse(json).right.get)
  println(res)
}

Results in:

<root><sampleboolean>true</sampleboolean><sampleobj><anInt>1</anInt><aString>string</aString></sampleobj><objarray><v1>1</v1><v2>2</v2></objarray></root>

That's all fine and dandy if the low-level JSON objects (like JBoolean, JString, JObject etc.) were not package private in circe which only makes this code above work if it is put in package package io.circe.

How can you achieve the same result like above using the public circe API?


Solution

  • The fold method on Json allows you to perform this kind of operation quite concisely (and in a way that enforces exhaustivity, just like pattern matching on a sealed trait):

    import io.circe.Json
    
    def transformToXMLString(js: Json): String = js.fold(
      "",
      _.toString,
      _.toString,
      identity,
      _.map(transformToXMLString(_)).mkString(""),
      _.toMap.map {
        case (k, v) => s"<${k}>${transformToXMLString(v)}</${k}>"
      }.mkString("")
    )
    

    And then:

    scala> import io.circe.parser.parse
    import io.circe.parser.parse
    
    scala> transformToXMLString(parse(json).right.get)
    res1: String = <root><sampleboolean>true</sampleboolean><sampleobj><anInt>1</anInt><aString>string</aString></sampleobj><objarray><v1>1</v1><v2>2</v2></objarray></root>
    

    Exactly the same result as your implementation, but with a few fewer characters and no relying on private details of the implementation.

    So the answer is "use fold" (or the asX methods as suggested in the other answer—that approach is more flexible but in general is likely to be less idiomatic and more verbose). If you care about why we've made the design decision in circe not to expose the constructors, you can skip to the end of this answer, but this kind of question comes up a lot, so I also want to address a few related points first.

    A side note about naming

    Note that the use of the name "fold" for this method is inherited from Argonaut, and is arguably inaccurate. When we talk about catamorphisms (or folds) for recursive algebraic data types, we mean a function where we don't see the ADT type in the arguments of the functions we're passing in. For example, the signature of the fold for lists looks like this:

    def foldLeft[B](z: B)(op: (B, A) => B): B
    

    Not this:

    def foldLeft[B](z: B)(op: (List[A], A) => B): B
    

    Since io.circe.Json is a recursive ADT, its fold method really should look like this:

    def properFold[X](
      jsonNull: => X,
      jsonBoolean: Boolean => X,
      jsonNumber: JsonNumber => X,
      jsonString: String => X,
      jsonArray: Vector[X] => X,
      jsonObject: Map[String, X] => X
    ): X
    

    Instead of:

    def fold[X](
      jsonNull: => X,
      jsonBoolean: Boolean => X,
      jsonNumber: JsonNumber => X,
      jsonString: String => X,
      jsonArray: Vector[Json] => X,
      jsonObject: JsonObject => X
    ): X
    

    But in practice the former seems less useful, so circe only provides the latter (if you want to recurse, you have to do it manually), and follows Argonaut in calling it fold. This has always made me a little uncomfortable, and the name may change in the future.

    A side note about performance

    In some cases instantiating the six functions fold expects may be prohibitively expensive, so circe also allows you to bundle the operations together:

    import io.circe.{ Json, JsonNumber, JsonObject }
    
    val xmlTransformer: Json.Folder[String] = new Json.Folder[String] {
        def onNull: String = ""
      def onBoolean(value: Boolean): String = value.toString
      def onNumber(value: JsonNumber): String = value.toString
      def onString(value: String): String = value
      def onArray(value: Vector[Json]): String =
        value.map(_.foldWith(this)).mkString("")
      def onObject(value: JsonObject): String = value.toMap.map {
        case (k, v) => s"<${k}>${transformToXMLString(v)}</${k}>"
      }.mkString("")
    }
    

    And then:

    scala> parse(json).right.get.foldWith(xmlTransformer)
    res2: String = <root><sampleboolean>true</sampleboolean><sampleobj><anInt>1</anInt><aString>string</aString></sampleobj><objarray><v1>1</v1><v2>2</v2></objarray></root>
    

    The performance benefit from using Folder will vary depending on whether you're on 2.11 or 2.12, but if the actual operations you're performing on the JSON values are cheap, you can expect the Folder version to get about twice the throughput of fold. Incidentally it's also significantly faster than pattern matching on the internal constructors, at least in the benchmarks we've done:

    Benchmark                           Mode  Cnt      Score    Error  Units
    FoldingBenchmark.withFold          thrpt   10   6769.843 ± 79.005  ops/s
    FoldingBenchmark.withFoldWith      thrpt   10  13316.918 ± 60.285  ops/s
    FoldingBenchmark.withPatternMatch  thrpt   10   8022.192 ± 63.294  ops/s
    

    That's on 2.12. I believe you should see even more of a difference on 2.11.

    A side note about optics

    If you really want pattern matching, circe-optics gives you a high-powered alternative to case class extractors:

    import io.circe.Json, io.circe.optics.all._
    
    def transformToXMLString(js: Json): String = js match {
        case `jsonNull` => ""
      case jsonBoolean(b) => b.toString
      case jsonNumber(n) => n.toString
      case jsonString(s) => s.toString
      case jsonArray(a) => a.map(transformToXMLString(_)).mkString("")
      case jsonObject(o) => o.toMap.map {
        case (k, v) => s"<${k}>${transformToXMLString(v)}</${k}>"
      }.mkString("")
    }
    

    This is almost exactly the same code as your original version, but each of these extractors is a Monocle prism that can be composed with other optics from the Monocle library.

    (The downside of this approach is that you lose exhaustivity checking, but unfortunately that can't be helped.)

    Why not just case classes

    When I first started working on circe I wrote the following in a document about some of my design decisions:

    In some cases, including most significantly here the io.circe.Json type, we don't want to encourage users to think of the ADT leaves as having meaningful types. A JSON value "is" a boolean or a string or a unit or a Seq[Json] or a JsonNumber or a JsonObject. Introducing types like JString, JNumber, etc. into the public API just confuses things.

    I wanted a really minimal API (and especially an API that avoided exposing types that weren't meaningful) and I wanted room to optimize the JSON representation. (I also just didn't really want people to be working with the JSON AST at all, but that's been more of a losing battle.) I still think hiding the constructors was the right decision, even though I haven't really taken advantage of their absence in optimizations (yet), and even though this question comes up a lot.