Search code examples
cql3cassandra-2.0

cassandra - cql - count and group data imported from csv file


I'm a newer with no-sql and my background is based on sql db (mysql). Last months I started to work with big data and I choose cassandra as no-sql db.

This is my dev environment:

  • ubuntu 12.04 64 bit
  • cqlsh 4.1.1
  • Cassandra 2.0.6
  • CQL spec 3.1.1
  • Thrift protocol 19.39.0

My input is a daily csv files with many columns and I've to import just some of these columns. The structure of the csv file user_id => text

  • col_A => int
  • col_B => int
  • col_C => int
  • other_col => do not import
  • .....
  • .....
  • .....
  • other_col => do not import

What is the condition for importing a csv row? the value of columns user_id + col_A + col_B + col_C must to be unique.

Then I thought to create a table with as many primary key as columns

CREATE TABLE unique_value (
  user_id text,
  col_A int,
  col_B int,
  col_C int,
  PRIMARY KEY (user_id, col_A, col_B, col_C)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

During the import will be insert just unique values of the combination "user_id + col_A + col_B + col_C" of the csv file and that is what I want.

After the import I need to query the table in order to get the total number of the unique user_id (not duplicated) grouped by the values of col_B. In sql the query should be

SELECT COUNT(b.user_id), b.col_B(
(SELECT COUNT(user_id) AS is_user_exclusive, user_id, col_B FROM unique_value
GROUP BY col_B
HAVING is_user_exclusive < 2) AS b
GROUP BY b.col_B

but I still can't find the right cql query or probably the right data modeling.

Do you have any hints?

Thank you in advance


Solution

  • Take a look at the counters!!

    http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_counter_t.html

    You could create another table to do this counting

    create table mycounts (count counter, user_id text, col_b int, PRIMARY KEY (user_id, col_b))

    So, whenever you insert into unique_values table make a insert into the mycounts table. Then when you query you just do select * from mycounts. I hope this helps!!