cassandra cassandra-2.0 cassandra-stress

what is select distribution ratio under insert distributions in cassandra stress tool?

select distribution ratio: The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns). default FIXED(1)/1

can someone explain what this means? and why this it is called select distribution ration when it is under insert distribution?

http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

Solution

In cassandra, data is assigned to a given node by the partition key, and then stored sorted on disk based on the clustering key within the partition.

The 'distribution ratio' allows you to define:

1) How many rows the stress tool will create in each partition,

2) How many rows the stress tool will read from each partition (they'll be ordered, so it's fairly fast to grab more than one)

In the case of FIXED(), that means each partition will have the FIXED number of rows - if you choose some of the other options, you'll end up with a variable number of rows.

Edit to explain multiple rows per partition:

For example, if you had a data model where you gathered weather information from different cities:

CREATE TABLE sensor_readings (
station_id text,
weather_time timestamp,
temperature int,
humidity int,
PRIMARY KEY(station_id, weather_time));

In this case, you have multiple rows (one for each weather_time) in each partition (station_id). You can query for all sensor readings in a given station_id, or you can query for only one specific weather_time. The distribution ratio controls how many weather_times you have per station_id.