Search code examples
serializationstorageprotocol-buffersthrift

Store a set of protobufs on disk


I am using protobuf as a serializer to format data on disk. I may have a large set of protobuf object, say, millions of them. what is the best choice to layout them on disk? the protobuf objects will be read sequentially one by one or random accessed read by a external index.

I used to use lenghth(int)+protobuf_object+length(int).... format, but it failed if one of the protobuf happens to be dirty. and if many of the protobuf object are small, it may have some overhead.


Solution

  • If you only need sequential access, the easiest way to store multiple messages is to write the size of the object before it, as reccomended by the documentation: http://developers.google.com/protocol-buffers/docs/techniques#streaming

    For example, you can create a class 'MessagesFile' with the following member functions to open, read and write your messages:

    // File is opened using append mode and wrapped into
    // a FileOutputStream and a CodedOutputStream
    bool Open(const std::string& filename,
              int buffer_size = kDefaultBufferSize) {
    
        file_ = open(filename.c_str(),
                     O_WRONLY | O_APPEND | O_CREAT, // open mode
                     S_IREAD | S_IWRITE | S_IRGRP | S_IROTH | S_ISUID); //file permissions
    
        if (file_ != -1) {
            file_ostream_ = new FileOutputStream(file_, buffer_size);
            ostream_ = new CodedOutputStream(file_ostream_);
            return true;
        } else {
            return false;
        }
    }
    
    // Code for append a new message
    bool Serialize(const google::protobuf::Message& message) {
        ostream_->WriteLittleEndian32(message.ByteSize());
        return message.SerializeToCodedStream(ostream_);
    }
    
    // Code for reading a message using a FileInputStream
    // wrapped into a CodedInputStream 
    bool Next(google::protobuf::Message *msg) {
        google::protobuf::uint32 size;
        bool has_next = istream_->ReadLittleEndian32(&size);
        if(!has_next) {
            return false;
        } else {
            CodedInputStream::Limit msgLimit = istream_->PushLimit(size);
            if ( msg->ParseFromCodedStream(istream_) ) {
                istream_->PopLimit(msgLimit);
                return true;
            }
            return false;
        }
    }
    

    Then, to write your messagges use:

    MessagesFile file;
    reader.Open("your_file.dat");
    
    file.Serialize(your_message1);
    file.Serialize(your_message2);
    ...
    // close the file
    

    To read all your messages:

    MessagesFile reader;
    reader.Open("your_file.dat");
    
    MyMsg msg;
    while( reader.Next(&msg) ) {
        // user your message
    }
    ...
    // close the file