database cluster-analysis vector-database

Create Clusters of Rows in a Database based on similarities

I am designing a system that takes student's modules and groups them together in classes based on their department and subjects chosen.

Let's for example take the following table:

Student ID	Department	Choice 1	Choice 2	Choice 3	Choice 4
1	Languages	English	Literature	Spanish	Italian
2	Languages	English	Literature	Spanish	Italian
3	Languages	English	Literature	Spanish	German
4	Languages	English	Literature	Spanish	French
5	Science	Math	Physics	Chemistry	Biology
6	Science	Math	Physics	Chemistry	Computing

Let's say that each class can have 2 students.

With the above example 3 classes need to be created (1,2), (3,4) and (5,6). (Note: students grouped together must be on the same department)

Can you give me hints how to design the database so it can allow for a number of groups/cluster to be created. The application gonna be really big with a lot of entries.

I thought of using a Vector Database for this but I am not sure if it is the best option

I would appreciate any helpful hints, links or books

Solution

As you say you will have a lot of data and you might need vector capabilities, then Cassandra would make sense. You could start now on the free tier of AstraDB which is a managed cassandra service.

The denormalized data-model would have 3 tables and look like this:

First table to read student information

CREATE TABLE IF NOT EXISTS student_by_ID (
    Student_ID UUID PRIMARY KEY,
    Department TEXT,
    Choices LIST<TEXT>
);

Note if the number of choice will always be 4, then create 4 columns (choice1, choice2...)

Second table would allow you to retrieve a classe per id. Adding student_ID as a clustering column allow you to retrieve easly all the students for a given class.

CREATE TABLE IF NOT EXISTS classe_by_ID (
    Class_ID UUID,
    Student_ID UUID,
    Department TEXT,
    ...
    PRIMARY KEY (Class_ID, Student_ID)
);

Then we could have a last table for departments

CREATE TABLE IF NOT EXISTS Department_by_ID (
    Department_Name TEXT,
    Class_ID UUID,
    PRIMARY KEY (Department_Name, Class_ID)
);

Astra has vector search capabilities (that will be in Cassandra 5.0) so you can add vector type attribute to your tables to perform similarity search on it. (check this video)