Search code examples
databasemongodbcassandranosql

How to save quite big multidimensional data into Nosql database?


I got multidimensional data which size is over 2 GB. it has row names, col names, and counter matrix which is quite sparse (in short (m, n) matrix with many 0, but I need to use row names, and col names.), The data is for deep learning classification model, Which I've used as file system.

Dataset can have an variable n (col) size, and columns can have different names. So it isn't structured data.

And, One of my boss told me to build NoSQL database for saving and loading dataset. I tried MongoDB, and I tried to save multidimensional array into document of MongoDB which lead me to BSON size limit error (16MB). Since my boss wanted to fix (update) an element of dataset, I can't use gridFS. So I built data model with 2 collections something like

{ "datasetName" : datasetName, "rowNum" : m, "colNum" : n, "colNames" : colNames, #for later use. }

{ "rowName" : rowName "colVector" :colVector # sparse array for 1 row "reference" : datasetName }

My boss told me to build a new data model, since I am not using benefits of NoSQL without telling a specific instruction. He said it's too much like RDBMS.

I've thought to use Cassandra that does have bigger data limit, but saving data

wouldn't be much different from my former data model in MongoDB. So I am thinking to build model like

       col 1    col2    col3    
row 1     0        1       0
row 2     1        0       0

so think to make model like dataset1 : [0,1,2...], [1,2,3...], [0,0,0...]

with Cassandra. ([ ] is the columns that have an array of counter information of [row1,row2,...row] m) Which looks not good either.

How can I make better data model with using benefits of NoSQL?


Solution

  • It is determined based on how you would query (i.e. read) back the stored data.

    Modeling an ( m x n ) matrix in Apache Cassandra requires careful consideration of how you plan to query the data. Cassandra's data model is based on a distributed hash table, and it's optimized for queries on primary keys. Here's a general approach to modeling an ( m x n ) matrix:

    Schema

    You can create a table with the following schema (one example):

    CREATE TABLE matrix (
      row_id INT,
      col_id INT,
      value DOUBLE,
      <other_columns>,
      PRIMARY KEY ((row_id, col_id))
    );
    

    Here, row_id represents the row index, col_id represents the column index, and value represents the value at that position in the matrix.

    Inserting Data

    You can insert data into this table using the following CQL command:

    INSERT INTO matrix (row_id, col_id, value) VALUES (0, 0, 42.0);
    

    Querying Data

    You can query a specific cell with:

    SELECT value FROM matrix WHERE row_id = 0 AND col_id = 0;
    

    Or an entire row with:

    SELECT col_id, value FROM matrix WHERE row_id = 0 ALLOW FILTERING;
    

    <<<-- Using ALLOW FILTERING is to be avoided but just showing here as an example for discussion sake only. This will have performance impact.

    Considerations

    • Size: If the matrix is very large, you'll need to consider the impact on storage and query performance.
    • Access Patterns: Design the schema based on how you plan to query the data. If you need to query entire columns, you might need a different approach.
    • Data Distribution: Depending on the size of the matrix and the distribution of the data, you might need to consider partitioning strategies to ensure that the data is evenly distributed across nodes.

    Alternative Approaches

    If you often need to query entire rows or columns, you might consider using a different data structure, such as storing entire rows or columns as blobs or using a list or set data type.

    Remember, Apache Cassandra is optimized and designed around your query patterns. Always consider how you plan to query the data when designing your schema. Here is a browser-based free tutorial on learning data modeling by examples that I'd recommend.