I am looking at a yaml file for cassandra-stress:
# Keyspace name and create CQL
#
keyspace: stressexample
keyspace_definition: |
CREATE KEYSPACE stressexample WITH replication = {'class': 'NetworkTopologyStrategy', 'AWS_VPC_US_WEST_2': '2'};
#
# Table name and create CQL
#
table: eventsrawtest
table_definition: |
CREATE TABLE eventsrawtest (
host text,
bucket_time text,
service text,
time timestamp,
metric double,
state text,
PRIMARY KEY ((host, bucket_time, service), time)
) WITH CLUSTERING ORDER BY (time DESC)
#
# Meta information for generating data
#
columnspec:
- name: host
size: fixed(32) #In chars, no. of chars of UUID
population: uniform(1..600) # We have about 600 hosts with equal events per host
- name: bucket_time
size: fixed(18)
population: uniform(1..288) # 288 potential buckets
- name: service
size: uniform(10..100)
population: uniform(1000..2000) # 1000 - 2000 metrics per host
- name: time
cluster: fixed(15)
- name: state
size: fixed(4)
#
# Specs for insert queries
#
insert:
partitions: fixed(1) # 1 partition per batch
batchtype: UNLOGGED # use unlogged batches
select: fixed(10)/10 # no chance of skipping a row when generating inserts
#
# Read queries to run against the schema
#
queries:
pull-for-rollup:
cql: select * from eventsrawtest where host = ? and service = ? and bucket_time = ?
fields: samerow # pick selection values from same row in partition
get-a-value:
cql: select * from eventsrawtest where host = ? and service = ? and bucket_time = ? and time = ?
fields: samerow # pick selection values from same row in partition
I found this file on the internet and I don't quite understand how it works.
First of all, I don't understand columnspec. For partition columns host
, bucket_time
, service
, it says:
population: uniform(1..600) # We have about 600 hosts with equal events per host
population: uniform(1..288) # 288 potential buckets
population: uniform(1000..2000) # 1000 - 2000 metrics per host
Does that mean that I will have at most 600*288*2000 partitions? Is that the total number of partitions I will have when running cassandra-stress? Meaning that when the stress test is done, the maximum number of partitions I will see will be 600*288*2000? And the maximum number of columns I will see if I do "select count(*) from table" will be 600*288*2000*15?
Next I don't understand the insert part
partitions: fixed(1) # 1 partition per batch
Does this mean that only 1 partition will be updated with 1 insert operation?
select: fixed(10)/10 # no chance of skipping a row when generating inserts
What is this select? I don't understand how it works. At first my table is empty, how will it select and insert anything, if there's nothing in the table? Is my understanding correct that it picks 100% of data from each batch for insertion (since it says fixed(10)/10), and then inserts it?
The cassandra-stress
YAML you posted contains 4 sections:
keyspace:
and table:
to be stress-tested,columnspec:
section contains the meta-information that defines the how the synthetic data will be generated,insert:
section defines how data will be written, andqueries:
section defines how data will be read.For the partition key:
host
column will contain a fixed size of 32 characters with a population uniformly distributed between 1 to 600 hosts.bucket_time
column will contain a fixed size of 18 characters with a population uniformly distributed between 1 to 288 "buckets".service
column will contain 10 to 100 characters with 1000 to 2000 services.Since the possible number of service
columns is uniformly distributed from 1000 to 2000, we can assume that the average service
is 1500. This means that the total partitions (Tp
) is calculated by:
Tp = hosts x buckets x services
= 600 x 288 x 1500
The table has time
as a clustering key and since each partition contains a fixed size of 15 rows (according to the columnspec
), the maximum number of rows in the table (not columns) is:
max_rows = Tp x time_rows
= (600 x 288 x 1500) x 15
For the "write" section, the specification partitions: fixed(1)
means that each write operation will only ever insert 1
partition. The specification select: fixed(10)/10
means that all 10 rows ("selected" from 15 possible generated time
values in the columnspec
) will be written to a partition.
For more information on population distributions and statistical functions, see the cassandra-stress
document on the Apache Cassandra website. Cheers!