I'm a newer with no-sql and my background is based on sql db (mysql). Last months I started to work with big data and I choose cassandra as no-sql db.
This is my dev environment:
My input is a daily csv files with many columns and I've to import just some of these columns. The structure of the csv file user_id => text
What is the condition for importing a csv row? the value of columns user_id + col_A + col_B + col_C must to be unique.
Then I thought to create a table with as many primary key as columns
CREATE TABLE unique_value (
user_id text,
col_A int,
col_B int,
col_C int,
PRIMARY KEY (user_id, col_A, col_B, col_C)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
During the import will be insert just unique values of the combination "user_id + col_A + col_B + col_C" of the csv file and that is what I want.
After the import I need to query the table in order to get the total number of the unique user_id (not duplicated) grouped by the values of col_B. In sql the query should be
SELECT COUNT(b.user_id), b.col_B(
(SELECT COUNT(user_id) AS is_user_exclusive, user_id, col_B FROM unique_value
GROUP BY col_B
HAVING is_user_exclusive < 2) AS b
GROUP BY b.col_B
but I still can't find the right cql query or probably the right data modeling.
Do you have any hints?
Thank you in advance
Take a look at the counters!!
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_counter_t.html
You could create another table to do this counting
create table mycounts (count counter, user_id text, col_b int, PRIMARY KEY (user_id, col_b))
So, whenever you insert into unique_values table make a insert into the mycounts table. Then when you query you just do select * from mycounts. I hope this helps!!