Search code examples
jsonstreamsmalltalkpharondjson

How to parse ndjson in Pharo with NeoJSON


I want to parse ndjson (newline delimited json) data with NeoJSON on Pharo Smalltalk.

ndjson data looks like this:

{"smalltalk": "cool"}
{"pharo": "cooler"}

At the moment I convert my file stream to a string, split it on newline and then parse the single parts using NeoJSON. This seems to use an unnecessary (and extremely huge) amount of memory and time, probably because of converting streams to strings and vice-versa all the time. What would be an efficient way to do this task?

If you look for sample data: NYPL-publicdomain: pd_items_1.ndjson


Solution

  • This the answer of Sven (the author of NeoJSON) at pharo-users mailing list (he is not on SO):

    Reading the 'format' is easy, just keep on doing #next for each JSON expression (whitespace is ignored).

    | data reader |
    data := '{"smalltalk": "cool"}
    {"pharo": "cooler"}'.
    reader := NeoJSONReader on: data readStream.
    Array streamContents: [ :out |
      [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
    

    Preventing intermediary data structures is easy too, use streaming.

    | client reader data networkStream |
    (client := ZnClient new)
      streaming: true;
      url: 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
      get.
    networkStream := ZnCharacterReadStream on: client contents.
    reader := NeoJSONReader on: networkStream.
    data := Array streamContents: [ :out |
      [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
    client close.
    data.
    

    It took a couple of seconds, it is 80MB+ over the network for 50K items after all.