Search code examples
javascripttypescriptwhatwg-streams-api

How to stream a nested structure in JavaScript/TypeScript?


I am implementing a REST API endpoint that returns lists of objects of different types. The list will have roughly the following shape:

{
    "type1": [
        { "prop1": "value1a", ... },
        { "prop1": "value1b", ... },
        ...
    ],
    "type2": [
        { "prop2": "value2a", ... },
        ...
    ],
    "type3": [
        { "prop3": "value3a", ... },
        ...
    ]
}

As the list of objects can get quite long, I would like to emit it as a stream, so that neither the server nor the client have to keep the whole list in memory and the client can already start processing the data before all of it has arrived. The JSON object would be streamed in chunks like this:

{
    "type1": [
        { "prop1": "value1a", ... },
        { "prop1": "value2b", ... }
    ],
    "type2": [
        ...

So far, so good.

To make it easier for people to use my REST API, I want to provide a TypeScript library that provides methods to call the various API endpoints. My problem is that JavaScript streams are made for flat structures, but I’m trying to find a way to stream a nested structure.

How can I stream a nested structure in JavaScript/TypeScript?


Solution

  • After experimenting with this for a while, here are two approaches that I have come up with along with their advantages and disadvantages.

    Approach 1: flattened entries stream

    Flatten the stream into a stream of entries (ReadableStream<["type1", Type1Object] | ["type2", Type2Object] | ["type3", Type3Object]>).

    Flattened entries streams are rather easy to create, as the data already arrives in the form of a flat stream (of bytes) from the REST API, so it just needs to be transformed into a stream of entries.

    The main problem that I have experienced while trying to consume flattened streams is that their type doesn’t guarantee or even indicate in what order the items arrive. Let’s assume a ReadableStream<["type1", Type1Object] | ["type2", Type2Object] | ["type3", Type3Object]> and I want to first process the "type1" objects and then the "type2" objects. While the REST API itself might guarantee that first all the "type1" objects will arrive, then all the "type2" objects and then the "type3" objects, and I might mention this in the documentation of my code, the ReadableStream type itself does not guarantee this order, so it would be bad style to rely on it (and the order of the REST API response might change). Instead, a consumer would have to iterate over the whole stream in order to be sure to have all the objects of one type. This means that consumers are either forced to implement their code in a way that manages to handle all object types in parallel, or they have to cache a lot of objects in memory before being able to process them.

    Approach 2: nested streams

    Return a nested stream for each object type (ReadableStream<(ReadableStream<Type1Object> & { type: "type1" }) | (ReadableStream<Type2Object> & { type: "type2" }) | (ReadableStream<Type3Object>]> & { type: "type3" })). The advantage of this approach is that it resembles the underlying object structure much more and can be iterated over in the same way as the plain object could be. Individual sub-streams can be teed or piped to different destinations if needed.

    A common way to consume a nested stream would be through a nested iteration:

    for await (const subStream of parentStream) {
        for await (const chunk of subStream) {
            // Do something with chunk
        }
    }
    

    Producing a stream that can be consumed in this way is not too complicated, as the chunks are consumed in the order they arrive from the REST API. However, there are many other ways how such a stream can be consumed, and this is where the main challenge with this approach emerges: It is very difficult to implement it right. A nested stream needs to handle the following cases for example:

    • A consumer might first iterate over the whole parent stream and then start consuming some of the sub streams later. If the nested stream is not implemented right, there is a chance that the parent stream will get stuck if the sub streams are not consumed immediately.
    • Related to that, a consumer might put some of the sub streams aside and consume them later in a different order than how they have arrived. The nested stream needs to buffer chunks in order to support such cases.
    • A consumer might cancel a sub stream if they don’t require the rest of its chunks, either by calling break in the iteration or by calling subStream.cancel(). The nested stream needs to handle this in a way that the parent stream continues emitting the rest of the sub streams and that the rest of the sub streams emit their data.
    • The source stream might abort. The nested stream needs to make sure that this abortion is forwarded to the parent stream and to all active sub streams, so that the consumer is notified about the abortion no matter which part of the nested stream they are currently consuming. If not implemented right (for example the abortion is forwarded only to the parent stream), there is a chance that the iteration will get stuck waiting for a chunk on a sub stream.
    • A consumer might want to tee the parent stream. This needs to tee all the sub streams as well.
    • With all the challenges above, the implementation should still be able to apply back pressure if neither the parent stream nor any of the sub streams are currently being consumed.

    If an implementation manages to get all of these right, nested streams can be a useful (although unusual) way to represent this data. As an inspiration, you can have a look at my implementation of StreamSplitter, which converts a flattened stream to a nested stream (with the assumption that the chunks arrive in order).

    Consumers of a nested stream need to keep in mind that individual sub streams need to be discarded by calling subStream.cancel() if not used, otherwise their data will remain in memory.

    Conclusion

    From the experience that I’ve gathered so far, I think that each of the approaches has advantages/disadvantages in certain scenarios. If implemented right, nested streams are more versatile, since they can be easily converted to flattened streams if needed, which is not possible the other way round. But they require a lot more thought and testing to implement. So my personal conclusion is that when time and resources allow, nested streams are the better option for the user, but otherwise, flattened streams also work and are much easier to implement.