Search code examples
c++serializationboostdeserialization

How to deserialize only parts of a file using Boost in C++?


I have a number of instances of the same class that I serialize using boost::archive::binary_oarchive. They are saved in a given order. I am only interested in loading one of them and I know its position. How do I retrieve (deserialize) that one object without having to deserialize almost everything?

More generally, what is the best way to retrieve only some objects from a file?

Right now, my code looks something like this:

std::ofstream saveFile("savefile.save");

boost::archive::binary_oarchive oa(saveFile);
oa << arrayOfObjects;
        
saveFile.close();

// Later...

std::ifstream loadFile("savefile.save");

boost::archive::binary_iarchive ia(loadFile);
ia >> arrayOfObjects;

auto oneSpecificObject = arrayOfObjects[i]; // I have to do this; not efficient

loadFile.close();

Thanks in advance and cheers,


Solution

  • It all depends on what the exact type of arrayOfObjects is.

    Because that is the deciding factor on how things get serialized.

    If it is a true array, things might not even be too complicated. Though it becomes pretty tricky again as soon as object tracking is involved. E.g.

    Live On Coliru

    X x{"the answer is 42"};
    // std::vector arrayOfObject { &x, &x, &x, &x, &x, &x, &x, &x }; // OR:
    X* arrayOfObject[] = { &x, &x, &x, &x, &x, &x, &x, &x };
    
    {
        boost::archive::text_oarchive oa(std::cout);
        oa << arrayOfObject;
    }
    

    Prints

    22 serialization::archive 19 8 0 1 0
    0 16 the answer is 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    

    Hence, naively reading back only the 6th element would lead to... unspecified results. So, my recommendation would be to /just read the entire array/ and discard all the data you didn't need.

    Hacking It

    If we are going to be unsafe, assuming no complicating factors like mentioned and depending on the implementation details (e.g. how a vector is actually serialized), you can write deserialization to match and get the behaviour you wished:

    Live On Coliru - Don't Try This At Home

    #include <boost/archive/binary_iarchive.hpp>
    #include <boost/archive/binary_oarchive.hpp>
    #include <boost/archive/text_iarchive.hpp>
    #include <boost/archive/text_oarchive.hpp>
    #include <boost/archive/xml_iarchive.hpp>
    #include <boost/archive/xml_oarchive.hpp>
    #include <boost/serialization/string.hpp>
    #include <boost/serialization/vector.hpp>
    #include <boost/preprocessor.hpp>
    #include <iostream>
    #include <sstream>
    
    #ifndef TYPE
    #define TYPE xml
    #endif
    using oarchive = boost::archive::BOOST_PP_CAT(TYPE, _oarchive);
    using iarchive = boost::archive::BOOST_PP_CAT(TYPE, _iarchive);
    
    struct X {
        std::string answer;
        void serialize(auto& ar, unsigned) { ar& BOOST_SERIALIZATION_NVP(answer); }
    };
    
    template <typename T, size_t TargetIndex>
    struct FakeVectorReader {
        T element;
    
        template <typename Ar> void serialize(Ar& ar, unsigned)
        {
            static_assert(typename Ar::is_loading{});
            using namespace boost::serialization;
    
            collection_size_type count;
            ar >> make_nvp("count", count);
    
            if (library_version_type(3) < ar.get_library_version()) {
                item_version_type item_version(0);
                ar >> make_nvp("item_version", item_version);
            }
    
            assert(count > TargetIndex);
    
            T v;
            for (size_t i = 0; i < count; ++i) {
                ar >> make_nvp("item", v);
                if (i == TargetIndex) {
                    element = std::move(v);
                    ar.reset_object_address(&element, &v); // a bit half-hearted, this
                }
            }
        }
    };
    
    int main()
    {
        std::vector const arrayOfObject{
            X{"zero"}, {"one"}, {"two"},   {"three"}, {"four"},
            {"five"},  {"six"}, {"seven"}, {"eight"}, {"nine"},
        };
    
        std::stringstream ss;
        {
            oarchive oa(ss);
            oa << BOOST_SERIALIZATION_NVP(arrayOfObject);
        }
    
        if (std::string("binary") != BOOST_PP_STRINGIZE(TYPE)) {
            std::cout << ss.str() << std::endl;
        }
    
        {
            iarchive ia(ss);
            FakeVectorReader<X, 6> hack;
            ia >> boost::serialization::make_nvp("arrayOfObject", hack);
    
            std::cout << "hack.element: " << hack.element.answer << "\n";
        }
    }
    

    Printing

    22 serialization::archive 19 0 0 10 0 0 0 4 zero 3 one 3 two 5 three 4 four 4 five 3 six 5 seven 5 eight 4 nine
    hack.element: six
    

    Don't Try This At Home

    I'm trusting you will use this knowledge wisely.

    • I delved deep into the implementation details,
    • did a half-hearted lip-service to object tracking there, knowing that it will break when objects are actually aliased within an archive
    • should caution that all of this will break if you so much as change the container type or
    • even upgrade to a newer version of the Boost Library.
    • keep in mind that all elements are actually deserialized (this is so that if more data follows the array then at least you can still read that).
    • It will leak if X is not safe (follows Rule-Of-3/5/0)

    I hope the answer helps to illustrate why you shouldn't, and perhaps if somehow you cannot avoid this (you need to deserialize a part of that multi-terabyte archive, but don't have access to the supercomputer anymore?)