We're running Datastax Enterprise 4.0.1 and experimenting with running different M/R jobs against a CF in Cassandra. We've setup the column family thusly:
CREATE TABLE pageviews (
website text,
date text,
created timestamp,
browser_id text,
ip text,
referer text,
user_agent text,
PRIMARY KEY ((website, date), created, browser_id)
) WITH bloom_filter_fp_chance=0.001000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
The benefit of Hive is that it handles the CQL3 "flattening", to abstract Cassandra's underlying column/row storage mechanism. The downside appears to be that it doesn't use Cassandra's partition key or primary key to perform fast lookups, for e.g.
SELECT COUNT(1) WHERE website = "blah" AND date = "blah";
Running that MR job appears to perform a full table scan instead of pre-narrowing the keys it has to parse through. Is it possible to tell Hive not to perform a full table scan if there are obvious benefits to filtering based on partition/primary key?
Side note: When using Pig, it appears that it can and does use Cassandra's partition/primary key to perform fast lookups. The downside of Pig being that we have to do all of our filtering and flattening ourselves - greatly impeding the time to create jobs.
The best bet is to use Pig, and use cql:// with CqlStorage(), which does the heavy lifting of flattening the Cassandra data for you, e.g.
grunt> pageviews = LOAD 'cql://ks/pageviews' USING CqlStorage();
grunt> describe pageviews;
grunt> pageviews: {website: chararray,date: chararray,created: long,browser_id: chararray,ip: chararray,referer: chararray,user_agent: chararray}