Search code examples
databasecluster-analysisvector-database

Create Clusters of Rows in a Database based on similarities


I am designing a system that takes student's modules and groups them together in classes based on their department and subjects chosen.

Let's for example take the following table:

Student ID Department Choice 1 Choice 2 Choice 3 Choice 4
1 Languages English Literature Spanish Italian
2 Languages English Literature Spanish Italian
3 Languages English Literature Spanish German
4 Languages English Literature Spanish French
5 Science Math Physics Chemistry Biology
6 Science Math Physics Chemistry Computing

Let's say that each class can have 2 students.

With the above example 3 classes need to be created (1,2), (3,4) and (5,6). (Note: students grouped together must be on the same department)

Can you give me hints how to design the database so it can allow for a number of groups/cluster to be created. The application gonna be really big with a lot of entries.

I thought of using a Vector Database for this but I am not sure if it is the best option

I would appreciate any helpful hints, links or books


Solution

  • As you say you will have a lot of data and you might need vector capabilities, then Cassandra would make sense. You could start now on the free tier of AstraDB which is a managed cassandra service.

    The denormalized data-model would have 3 tables and look like this:

    First table to read student information

    CREATE TABLE IF NOT EXISTS student_by_ID (
        Student_ID UUID PRIMARY KEY,
        Department TEXT,
        Choices LIST<TEXT>
    );
    

    Note if the number of choice will always be 4, then create 4 columns (choice1, choice2...)

    Second table would allow you to retrieve a classe per id. Adding student_ID as a clustering column allow you to retrieve easly all the students for a given class.

    CREATE TABLE IF NOT EXISTS classe_by_ID (
        Class_ID UUID,
        Student_ID UUID,
        Department TEXT,
        ...
        PRIMARY KEY (Class_ID, Student_ID)
    );
    

    Then we could have a last table for departments

    CREATE TABLE IF NOT EXISTS Department_by_ID (
        Department_Name TEXT,
        Class_ID UUID,
        PRIMARY KEY (Department_Name, Class_ID)
    );
    
    

    Astra has vector search capabilities (that will be in Cassandra 5.0) so you can add vector type attribute to your tables to perform similarity search on it. (check this video)