Search code examples
jsonhaskellhaskell-pipes

Subsampling a huge json array with Haskell


I have a huge Json file that I would like to avoid loading entirely to memory. Its structure is pretty simple: it consists of a large array with arbitrary elements inside. I'd simply like to transform the array by randomly dropping most of the elements, and simply outputting the transformed Json.

Haskell seems well suited to this problem with all the laziness, and I thought it would make a nice Haskell exercise (I'm not an expert, and I don't know much FP theory).

I've found pipes-aeson [1] which seems to be what I want, but after trying for a while, I have to admit I'm stuck. There are almost no examples, and while I can work with Pipes to downsample data, working with a Parser object seems more complicated. The option I've found (evalStateT) is strict and parses the whole thing, without letting me intervene.

Maybe Lenses would be the solution to my problem, but they're very abstract I don't get what they are nor how to use them.

Could someone more knowledgeable than I am provide a little guidance?

[1] https://hackage.haskell.org/package/pipes-aeson-0.4.1.3/docs/Pipes-Aeson.html#t:DecodingError


Solution

  • I believe you will not be able to reuse aeson for this. From the aeson Parser documentation:

    It can be useful to think of parsing as occurring in two phases:

    • Identification of the textual boundaries of a JSON value. This is always strict, so that an invalid JSON document can be rejected as soon as possible.
    • Conversion of a JSON value to a Haskell value. This may be either immediate (strict) or deferred (lazy); see below for details.

    The first bullet seems to imply (to me, at least) that the parser will not hand you anything until it has inspected enough of the string it is supposed to be parsing to know whether parsing succeeded or failed -- in your case, that's almost certainly the entire string. So this phase will put (some representation of) the entire object in memory at once.

    This property is true of most parser combinator libraries at the moment. You could consider looking into uu-parsinglib as an alternative; I believe it supports returning partial parses. There is a very readable paper describing its capabilities linked from its Hackage page.