Search code examples
serializationflatbuffers

FlatBuffers: How to write giant file using FlatBuffers


I have large data which may be 30 GB. It seems that I need to partition them into many smaller pieces so that I can store them using flatbuffer respectively.

I have already read this post. FlatBuffers: How to write giant files

However, I'm still not sure how to do it. I have two questions below.

I have a schema like this.

table A {
  number: int;
}
table B {
  a: [A];
}

root_type B

If I have some objects a0, a1, a2, and a3, I partition them into two FlatBuffers and store them in disk. The first FlatBuffer contains a0 and a1. The second contains a2 and a3. If I need a2 data, how do I know which FlatBuffer contains a2? Does FlatBuffers API support this?

I create a0, a1, a2, a3,... sequentially, and I want to partition them once the FlatBuffer size is larger than 10 MB. I know I can get the size of the flatbuffer via int size = builder.GetSize(). However, since I create these objects sequentially, how do I know the size of FlatBuffer without calling builder.Finish(orc)?

Thanks for your help.

Updated: I wrote some codes like this:

flatbuffers::FlatBufferBuilder builder;
int num0 = 3;
int num1 = 1;
int num2 = 5;
int num3 = 7;
auto a0 = CreateA(builder, num0);
cout << "size of a0 = " << builder.GetSize() << endl;
auto a1 = CreateA(builder, num1);
cout << "size of a0 and a1 = " << builder.GetSize() << endl;
auto a2 = CreateA(builder, num2);
cout << "size a0, a1, and a2 = " << builder.GetSize() << endl;
auto a3 = CreateA(builder, num3);
cout << "size a0, a1, a2, and a3 = " << builder.GetSize() << endl;

std::vector<flatbuffers::Offset<A>> A_vector;
A_vector.push_back(a0);
A_vector.push_back(a1);
A_vector.push_back(a2);
A_vector.push_back(a3);
auto B = builder.CreateVector(A_vector);
auto orc = CreateB(builder, B);
builder.Finish(orc);
cout << "size all = " << builder.GetSize() << endl;

// size a0 = 14
// size of a0 and a1 = 30
// size a0, a1, and a2 = 40
// size a0, a1, a2, and a3 = 48
// size all = 80

Could you kindly explain how these size be calculated? Why does the size of a0 and a1 not twice of a0? That is, 14*2 = 28 instead of 30. Same problem in a2 and a3. Finally, why does the size all equal to 80?

Thanks again.


Solution

  • There is no support in FlatBuffers for organizing data across multiple FlatBuffers, you'd have to invent your own mechanism for indexing these. If the size of object doesn't differ too wildly, then just storing the exact same amount of objects in each FlatBuffer would definitely be simplest and most efficient.

    If it is more important that the FlatBuffers are a particular size, then like you say, keep serializing object until GetSize() is the size you want, though after that you'd still need to serialize the vector holding all these object offsets, which is 4 bytes * number of objects, and the root. When reading these, you'd first need to scan all FlatBuffers for the vector size to be able to index into them.