The major part of our data is string
s with possible substring duplication (eg. domains - "some.thing.com" and "thing.com"). We'd like to reuse the substrings to reduce file size and memory consumption with FlatBuffers, so i'm planning to use [string]
as i can just reference to some existing substrings, eg. thing.com
will be just a string created with let substr_offset = builder.create_string("thing.com")
and "some.thing.com" will be stored as [builder.create_string("some."), substr_offset]
.
However it seems referencing has the costs, so probably there is no benefit of referencing is the string is too short (less than offset variable size). Is it correct? Is offset type just usize
? What are better alternatives for prefix/postfix strings representations with FlatBuffers?
PS. BTW what is string array
instead of just string
cost? Is it just one more offset cost?
Both strings and vectors are addressed over a 32-bit offset to them, and also have a 32-bit size field prefixed. So:
"some.thing.com" 14 chars + 1 terminator + 4 size bytes == 19.
Or:
"thing.com" 9 chars + 1 terminator + 4 size bytes == 14.
"some." 5 chars + 1 terminator + 4 size bytes == 10.
vector of 2 strings: 2x4 bytes of offsets + 4 size bytes = 12.
total: 36 of those 36, 14 are shared, leaving 22 bytes of unique data which is larger than the original. So the shared string needs to be 13 bytes or larger for this technique to be worth it, assuming it is shared often.
For details: https://google.github.io/flatbuffers/flatbuffers_internals.html