Search code examples
cassandraembeddingbert-language-model

How to store Bert embeddings in cassandra


I want to use Cassandra as feature store to store precomputed Bert embedding, Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?

Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.


Solution

  • Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.

    I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:

    create table ks.bert(
      rowid int primary key,
      data frozen<list<int>>
    );
    

    In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.