Search code examples
databasedatabase-designcassandranosqldatabase-schema

Dynamic schema changes in Cassandra


I have lots of users(150-200 million). Each user has N(30-100) attributes. The attribute can be of type integer, text or timestamp. Attributes are not known, so I want to add them dynamically, on the fly.

Solution 1 - Add new column by altering the table

   CREATE TABLE USER_PROFILE(
    UID uuid PRIMARY KEY,
    LAST_UPDATE_DATE TIMESTAMP,
    CREATION_DATE TIMESTAMP
   );

For each new attribute:

    ALTER TABLE USER_PROFILE ADD AGE INT;
    INSERT INTO USER_PROFILE ( UID, LAST_UPDATE_DATE, CREATION_DATE, AGE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);

Solution 2 - Fixed schema:

CREATE TABLE USER_PROFILE(
    UID uuid,
    ATTRIBUTE_NAME TEXT,
    ATTRIBUTE_VALUE_TEXT TEXT,
    ATTRIBUTE_VALUE_TIMESTAMP TIMESTAMP,
    ATTRIBUTE_VALUE_INT INT,
    LAST_UPDATE_DATE TIMESTAMP,
    CREATION_DATE TIMESTAMP,
    PRIMARY KEY (UID, ATTRIBUTE_NAME)
);

For each new attribute:

INSERT INTO USER_PROFILE ( UID, ATTRIBUTE_NAME, ATTRIBUTE_VALUE_INT, LAST_UPDATE_DATE, CREATION_DATE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 'age', 27, 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);

Which is the best solution in terms of performance?


Solution

  • I would personally go with the 2nd solution - having columns for each data type that is used, and use the attribute name as the last component of the primary key (see examples in my previous answers on that topic:

    First solution has following problems:

    • If you do schema modification from the code, then you need to coordinate these changes, otherwise you will get schema disagreement that will must be resolved by admins by restarting the nodes. And coordinated change will either slowdown the data insertion, or will create a single point of failure
    • Existence of many columns has significant performance impact. For example, per this very good analysis by The Last Pickle, having 100 columns instead of 10 increases read latency more than 10 times
    • You can't change attribute type if you'll need - in the solution with attribute as clustering column, you can just start to put attribute as another type. If you have attribute as column, you can't do that, because Cassandra doesn't allow to change column type (don't try to drop column & add it back with the new type - you'll corrupt your existing data). So you will need to create a completely new column for that attribute.