cassandra cassandra-stress cassandra-stress-tool

How does the cassandra-stress yaml file work?

I am looking at a yaml file for cassandra-stress:

# Keyspace name and create CQL
#
keyspace: stressexample
keyspace_definition: |
  CREATE KEYSPACE stressexample WITH replication = {'class': 'NetworkTopologyStrategy', 'AWS_VPC_US_WEST_2': '2'};
#
# Table name and create CQL
#
table: eventsrawtest
table_definition: |
  CREATE TABLE eventsrawtest (
        host text,
        bucket_time text,
        service text,
        time timestamp,
        metric double,
        state text,
        PRIMARY KEY ((host, bucket_time, service), time)
  ) WITH CLUSTERING ORDER BY (time DESC)
 
#
# Meta information for generating data
#
columnspec:
  - name: host
    size: fixed(32) #In chars, no. of chars of UUID
    population: uniform(1..600)  # We have about 600 hosts with equal events per host
  - name: bucket_time
    size: fixed(18)
    population: uniform(1..288) # 288 potential buckets
  - name: service
    size: uniform(10..100)
    population: uniform(1000..2000) # 1000 - 2000 metrics per host
  - name: time
    cluster: fixed(15) 
  - name: state
    size: fixed(4)
 
#
# Specs for insert queries
#
insert:
  partitions: fixed(1)      # 1 partition per batch
  batchtype: UNLOGGED       # use unlogged batches
  select: fixed(10)/10      # no chance of skipping a row when generating inserts
 
#
# Read queries to run against the schema
#
queries:
   pull-for-rollup:
      cql: select * from eventsrawtest where host = ? and service = ? and bucket_time = ?
      fields: samerow             # pick selection values from same row in partition
   get-a-value:
      cql: select * from eventsrawtest where host = ? and service = ? and bucket_time = ? and time = ?
      fields: samerow             # pick selection values from same row in partition

I found this file on the internet and I don't quite understand how it works.

First of all, I don't understand columnspec. For partition columns host, bucket_time, service, it says:

population: uniform(1..600)  # We have about 600 hosts with equal events per host
population: uniform(1..288) # 288 potential buckets
population: uniform(1000..2000) # 1000 - 2000 metrics per host

Does that mean that I will have at most 600*288*2000 partitions? Is that the total number of partitions I will have when running cassandra-stress? Meaning that when the stress test is done, the maximum number of partitions I will see will be 600*288*2000? And the maximum number of columns I will see if I do "select count(*) from table" will be 600*288*2000*15?

Next I don't understand the insert part

partitions: fixed(1)      # 1 partition per batch

Does this mean that only 1 partition will be updated with 1 insert operation?

select: fixed(10)/10      # no chance of skipping a row when generating inserts

What is this select? I don't understand how it works. At first my table is empty, how will it select and insert anything, if there's nothing in the table? Is my understanding correct that it picks 100% of data from each batch for insertion (since it says fixed(10)/10), and then inserts it?

Solution

The cassandra-stress YAML you posted contains 4 sections:

the schema of the keyspace: and table: to be stress-tested,
the columnspec: section contains the meta-information that defines the how the synthetic data will be generated,
the insert: section defines how data will be written, and
the queries: section defines how data will be read.

For the partition key:

The host column will contain a fixed size of 32 characters with a population uniformly distributed between 1 to 600 hosts.
The bucket_time column will contain a fixed size of 18 characters with a population uniformly distributed between 1 to 288 "buckets".
The service column will contain 10 to 100 characters with 1000 to 2000 services.

Since the possible number of service columns is uniformly distributed from 1000 to 2000, we can assume that the average service is 1500. This means that the total partitions (Tp) is calculated by:

    Tp = hosts x buckets x services
       = 600 x 288 x 1500

The table has time as a clustering key and since each partition contains a fixed size of 15 rows (according to the columnspec), the maximum number of rows in the table (not columns) is:

    max_rows = Tp x time_rows
             = (600 x 288 x 1500) x 15

For the "write" section, the specification partitions: fixed(1) means that each write operation will only ever insert 1 partition. The specification select: fixed(10)/10 means that all 10 rows ("selected" from 15 possible generated time values in the columnspec) will be written to a partition.

For more information on population distributions and statistical functions, see the cassandra-stress document on the Apache Cassandra website. Cheers!