I have a wide row column family which Im trying to run a map reduce job against. The CF is a time ordered collection of events, where the column names are essentially timestamps. I need to run the MR job against a specific date range in the CF.
When I run the job with the widerow property set to false, the expected slice of columns are passed into the mapper class. But when I set widerow to true, the entire column family is processed, ignoring the slice predicate.
The problem is that I have to use widerow support, as the number of columns in the slice can grow very large and consume all the memory if loaded in one go.
I've found this JIRA task which outlines the issue, but it has been closed off as "cannot reproduce" - https://issues.apache.org/jira/browse/CASSANDRA-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
Im running cassandra 1.2.6 and using cassandra-thrift 1.2.4 & hadoop-core 1.1.2 in my jar. The CF has been created using CQL3.
Its worth noting that this occurs regardless of whether I use a SliceRange or specify the columns using setColumn_names() - it still process all of the columns.
Any help will be massively appreciated.
So it seems that this is by design. In the word_count example in github, the following comment exists:
// this will cause the predicate to be ignored in favor of scanning everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
Urrrrgh. Fair enough then. It seems crazy that there is no way to limit the columns when using wide rows though.
UPDATE
Apparently the solution is to use the new apache.cassandra.hadoop.cql3 library. See the new example on github for reference: https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java