Search code examples
snappydata

Create a table in SnappyData for large data set


I have 33 million records with me which I want to insert into Snappydata database. I've already tried to create a column table without setting its options. The problem is that spark is loading the whole database into the RAM. I want to set the column options so that I get faster statistical counts.

According to the reference this is how we create the tables:

CREATE TABLE [IF NOT EXISTS] table_name {
    ( column-definition [ , column-definition  ] * )
    }
    USING row | column 
    OPTIONS (
    COLOCATE_WITH 'table-name',  // Default none
    PARTITION_BY 'column-name', // If not specified it will be a replicated table.
    BUCKETS  'num-partitions', // Default 128. Must be an integer.
    REDUNDANCY        'num-of-copies' , // Must be an integer
    EVICTION_BY 'LRUMEMSIZE integer-constant | LRUCOUNT interger-constant | LRUHEAPPERCENT',
    PERSISTENCE  'ASYNCHRONOUS | ASYNC | SYNCHRONOUS | SYNC | NONE’,
    DISKSTORE 'DISKSTORE_NAME', //empty string maps to default diskstore
    OVERFLOW 'true | false', // specifies the action to be executed upon eviction event
    EXPIRE 'time_to_live_in_seconds',
    COLUMN_BATCH_SIZE 'column-batch-size-in-bytes', // Must be an integer. Only for column table.
    COLUMN_MAX_DELTA_ROWS 'number-of-rows-in-each-bucket', // Must be an integer. Only for column table.
    )
    [AS select_statement];
  1. Can anyone suggest me the parameters I can set for options like BUCKETS,COLUMN_MAX_DELTA_ROWS etc which can increase the performance of the OLAP queries.

  2. Can I set these options for the table after creating it?

Eg: Like in SQL we can use ALTER to set extra options for a table like below:

ALTER TABLE t ENGINE = InnoDB;

Edit: We are performing batch insertion of 2 millions each. While each batch is inserted from json to the table in form of DFs it is consuming RAM like water. Our problem is the excessive RAM usage by snappy.


Solution

    1. For most use cases default should work fine. Only if you have lot many 'cores' then you can try to increase the number of BUCKETS so that cpus are utilized properly.

    Follow the guidelines for optimizing query latency

    1. We don't support altering the extra options for now.

    If you want to reduce memory consumption then use COLUMN table. You can try reducing the heap-size while starting the server. Please see estimating memory requirements.

    These parameters will be helpful to you : -heap-size='memory allocation', -critical-heap-percentage=90 -eviction-heap-percentage=81