I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132
) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.