Search code examples
jsonscalastreamiteratorjerkson

Scanning a HUGE JSON file for deserializable data in Scala


I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.

For example:

Let's say I can only deserialize into instances of the following:

case class Data(val a: Int, val b: Int, val c: Int)

and the expected JSON format is:

{   "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ], 
    "bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ], 
     .... MANY ITEMS .... , 
    "qux": [ {"a": 0, "b": 0, "c": 0 }  }

What I would like to do is:

import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.

As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.

What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.


Solution

  • Here is the current way I am solving the problem:

    import collection.immutable.PagedSeq
    import util.parsing.input.PagedSeqReader
    import com.codahale.jerkson.Json
    import collection.mutable
    
    private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
    private val clearAndStop = ']'
    
    private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
      val str = new StringBuilder()
      var readerFinal = readerInitial
    
      while(!readerFinal.atEnd && !str.endsWith(text)) {
        str += readerFinal.first
        readerFinal = readerFinal.rest
      }
    
      if (!str.endsWith(text) || str.contains(clearAndStop))
        Taken(readerFinal, None)
      else
        Taken(readerFinal, Some(str.toString))
    }
    
    private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
      var taken = Taken(readerInitial, None)
      chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))
    
      taken
    }
    
    def getJsonData() : Seq[Data] = {
      var data = mutable.ListBuffer[Data]()
      var taken = takeUntil(fileContent, "\"foo\"")
      taken = takeUntil(taken.reader, ':', '[')
    
      var doneFirst = false
      while(taken.text != None) {
        if (!doneFirst)
          doneFirst = true
        else
          taken = takeUntil(taken.reader, ',')
    
        taken = takeUntil(taken.reader, '}')
        if (taken.text != None) {
          print(taken.text.get)
          places += Json.parse[Data](taken.text.get)
        }
      }
    
      data
    }
    
    case class Taken(reader: PagedSeqReader, text: Option[String])
    case class Data(val a: Int, val b: Int, val c: Int)
    

    Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.