Search code examples
jsonparsingrustserde

Parse arbitrarily large JSON array in Rust


Let's say we have a JSON document where one of the properties is an arbitrarily large array of objects:

{
    "type": "FeatureCollection",
    "features":[
      {"type": "feature", ...},
      {"type": "feature", ...},
      ... many, many more objects ...
    ]
}

The document might even be sent over the network so the number of objects inside the array might be unknown in advance.

The document might be several gigabytes large.

How can I parse such a document in Rust (preferably with Serde) without loading it into memory first? I'm only interested in the objects in the array. The ‘parent’ object (if you will) can be ignored.


Solution

  • If your features array is reasonably close to the "top" of your JSON structure (i.e. only a single level down), you can reasonably do this with serde.

    The sad part is that the usual #[derive(Deserialize)] mechanic usually can't be used on the outer levels of your JSON structure, because you typically need some kind of state to process your stream of features, but the derived deserializers are stateless. So you must implement two DeserializeSeed.

    The first replaces what would be the #[derive(Deserialize)] on your outer struct, but calls next_value_seed instead of next_value on features. This is all boilerplate, and I' still waiting for someone to add this to serde's derive macros:

    struct FeatureCollectionStream<F>(F);
    impl<'de, F: FnMut(Feature)> DeserializeSeed<'de> for FeatureCollectionStream<F> {
        type Value = ();
    
        fn deserialize<D>(self, d: D) -> Result<Self::Value, D::Error>
        where
            D: serde::Deserializer<'de>,
        {
            return d.deserialize_struct("FeatureCollection", &["type", "features"], FCV(self.0));
            struct FCV<F>(F);
            impl<'de, F: FnMut(Feature)> Visitor<'de> for FCV<F> {
                type Value = ();
    
                fn visit_map<A: MapAccess<'de>>(mut self, mut map: A) -> Result<Self::Value, A::Error> {
                    while let Some(k) = map.next_key::<String>()? {
                        match k.as_str() {
                            "type" => {
                                map.next_value::<String>()?;
                            }
                            "features" => map.next_value_seed(FeatureStream(&mut self.0))?,
                            s => return Err(todo!()),
                        }
                    }
                    Ok(())
                }
            }
        }
    }
    

    The next serializer is the one you actually want, the one that processes the sequence of Feature as a stream:

    struct FeatureStream<F>(F);
    impl<'de, F: FnMut(Feature)> DeserializeSeed<'de> for FeatureStream<F> {
        type Value = ();
    
        fn deserialize<D>(self, d: D) -> Result<Self::Value, D::Error>
        where
            D: serde::Deserializer<'de>,
        {
            return d.deserialize_seq(FV(self.0));
            struct FV<F>(F);
            impl<'de, F: FnMut(Feature)> Visitor<'de> for FV<F> {
                type Value = ();
    
                fn visit_seq<A: serde::de::SeqAccess<'de>>(
                    mut self,
                    mut seq: A,
                ) -> Result<Self::Value, A::Error> {
                    while let Some(f) = seq.next_element()? {
                        (self.0)(f)
                    }
                    Ok(())
                }
            }
        }
    }
    

    The cute part is that you can still use the derive magic for the inner structures.

    #[derive(Deserialize, Default)]
    #[serde(rename_all = "snake_case")]
    enum FeatureType {
        #[default]
        Feature,
    }
    
    #[derive(Deserialize, Default)]
    struct Feature {
        #[allow(unused)]
        r#type: FeatureType,
        // ...   
    }
    

    Using this concoction:

    FeatureCollectionStream(|f: Feature| todo!("Do something with each feature"))
        .deserialize(&mut serde_json::Deserializer::from_reader(x))?;
    

    Playground with left-out error handling

    (c.f. another answer by me employing the same "trick")