Search code examples
rustserde

Skip empty objects when deserializing array with serde


I need to deserialize an array (JSON) of a type let call Foo. I have implemented this and it works well for most stuff, but I have noticed the latest version of the data will sometimes include erroneous empty objects.

Prior to this change, each Foo can be de-serialized to the following enum:

#[derive(Deserialize)]
#[serde(untagged)]
pub enum Foo<'s> {
    Error {
        // My current workaround is using Option<Cow<'s, str>>
        error: Cow<'s, str>,
    },
    Value {
        a: u32,
        b: i32,
        // etc.
    }
}

/// Foo is part of a larger struct Bar.
#[derive(Deserialize)]
#[serde(untagged)]
pub struct Bar<'s> {
    foos: Vec<Foo<'s>>,
    // etc.
}

This struct may represent one of the following JSON values:

// Valid inputs
[]
[{"a": 34, "b": -23},{"a": 33, "b": -2},{"a": 37, "b": 1}]
[{"error":"Unable to connect to network"}]
[{"a": 34, "b": -23},{"error":"Timeout"},{"a": 37, "b": 1}]

// Possible input for latest versions of data 
[{},{},{},{},{},{},{"a": 34, "b": -23},{},{},{},{},{},{},{},{"error":"Timeout"},{},{},{},{},{},{}]

This does not happen very often, but it is enough to cause issues. Normally, the array should include 3 or less entries, but these extraneous empty objects break that convention. There is no meaningful information I can gain from parsing {} and in the worst cases there can be hundreds of them in one array.

I do not want to error on parsing {} as the array still contains other meaningful values, but I do not want to include {} in my parsed data either. Ideally I would also be able to use tinyvec::ArrayVec<[Foo<'s>; 3]> instead of a Vec<Foo<'s>> to save memory and reduce time spent performing allocation during paring, but am unable to due to this issue.

How can I skip {} JSON values when deserializing an array with serde in Rust?

I also put together a Rust Playground with some test cases to try different solutions.


Solution

  • The simplest, but not performant, solution would be to define an enum that captures both the Foo case and the empty case, deserialize into a vector of those, and then filter that vector to get just the nonempty ones.

    #[derive(Deserialize, Debug)]
    #[serde(untagged)]
    pub enum FooDe<'s> {
        Nonempty(Foo<'s>),
        Empty {},
    }
    
    fn main() {
        let json = r#"[
            {},{},{},{},{},{},
            {"a": 34, "b": -23},
            {},{},{},{},{},{},{},
            {"error":"Timeout"},
            {},{},{},{},{},{}
        ]"#;
        let foo_des = serde_json::from_str::<Vec<FooDe>>(json).unwrap();
        let foos = foo_des
            .into_iter()
            .filter_map(|item| {
                use FooDe::*;
                match item {
                    Nonempty(foo) => Some(foo),
                    Empty {} => None,
                }
            })
            .collect();
        let bar = Bar { foos };
        println!("{:?}", bar);
    
        // Bar { foos: [Value { a: 34, b: -23 }, Error { error: "Timeout" }] }
    }
    

    Conceptually this is simple but you're allocating a lot of space for Empty cases that you ultimately don't need. Instead, you can control exactly how deserialization is done by implementing it yourself.

    struct BarVisitor<'s> {
        marker: PhantomData<fn() -> Bar<'s>>,
    }
    
    impl<'s> BarVisitor<'s> {
        fn new() -> Self {
            BarVisitor {
                marker: PhantomData,
            }
        }
    }
    
    // This is the trait that informs Serde how to deserialize Bar.
    impl<'de, 's: 'de> Deserialize<'de> for Bar<'s> {
        fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where
            D: Deserializer<'de>,
        {
            impl<'de, 's: 'de> Visitor<'de> for BarVisitor<'s> {
                // The type that our Visitor is going to produce.
                type Value = Bar<'s>;
    
                fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                    formatter.write_str("a list of objects")
                }
    
                fn visit_seq<V>(self, mut access: V) -> Result<Self::Value, V::Error>
                where
                    V: SeqAccess<'de>,
                {
                    let mut foos = Vec::new();
    
                    while let Some(foo_de) = access.next_element::<FooDe>()? {
                        if let FooDe::Nonempty(foo) = foo_de {
                            foos.push(foo)
                        }
                    }
    
                    let bar = Bar { foos };
    
                    Ok(bar)
                }
            }
    
            // Instantiate our Visitor and ask the Deserializer to drive
            // it over the input data, resulting in an instance of Bar.
            deserializer.deserialize_seq(BarVisitor::new())
        }
    }
    
    fn main() {
    let json = r#"[
            {},{},{},{},{},{},
            {"a": 34, "b": -23},
            {},{},{},{},{},{},{},
            {"error":"Timeout"},
            {},{},{},{},{},{}
        ]"#;
        let bar = serde_json::from_str::<Bar>(json).unwrap();
        println!("{:?}", bar);
    
        // Bar { foos: [Value { a: 34, b: -23 }, Error { error: "Timeout" }] }
    }