Search code examples
javacassandracql

Are CQL list values really limited to 65535 bytes?


This document lists a number of CQL limits for Cassandra 2.2. I'm particularly interested in the Collection limits for Set and List. If I've interpreted it correctly, the document states that values in Sets are limited to 65535 bytes.

This limit as far as I know exists because the set identity is implemented with a composite value in the column name of the storage engine's cell (similar to the clustering column value limit), which CQL restricts to that many bytes.

Consider a table, with a Set like

CREATE TABLE test.bounds (
    someid text,
    someorder text,
    words set<text>,
    PRIMARY KEY (someid, someorder)
)

with

PreparedStatement ps = session.prepare("INSERT INTO test.bounds (someid, someorder, words) VALUES (?, ?, ?)");
BoundStatement bs = ps.bind("id", "order", ImmutableSet.of(StringUtils.repeat('a', 66000)));
session.execute(bs);

This will throw the expected exception

Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: The sum of all clustering columns is too long (66024 > 65535)

Now if I change the table to use a List instead of a Set

CREATE TABLE test.bounds (
    someid text,
    someorder text,
    words list<text>,
    PRIMARY KEY (someid, someorder)
)

and use

BoundStatement bs = ps.bind("id", "order", ImmutableList.of(StringUtils.repeat('a', 66000)));

I do not receive an exception. The document, however, states that List value sizes are also limited to 65535 bytes. Is the document incorrect or am I misinterpreting?

I assumed List values are implemented as simple column values in the underlying storage and the order is maintained through their timestamps.


Solution

  • The documentation here is wrong as far as I understand it. That limitation was changed in protocol version 3 (introduced in C* 2.1). From the native protocol specification under the changes section for protocol 3:

    • The serialization format for collection has changed (both the collection size and the length of each argument is now 4 bytes long). See Section 6.

    So as long as you use protocol version 3 or higher, you can create lists with as many as 2^31-1 bytes (2147483647) or elements.

    Edit: I just noticed your comment about set identity, that may be a limitation of the storage engine itself, so perhaps the documentation was left this way for that reason, but the protocol itself supports larger collections now. Will pursue seeing if we can document that nuance.