Search code examples
cassandraapache-pigcql3

CqlStorage generates wrong Pig schema


I'm loading some simple data from Cassandra into Pig using CqlStorage. The CqlStorage loader defines a schema based on the Cassandra schema, but it seems to be wrong.

If I do:

data = LOAD 'cql://bookdata/books' USING CqlStorage();
DESCRIBE data;

I get this:

data: {isbn: chararray,bookauthor: chararray,booktitle: chararray,publisher: chararray,yearofpublication: int}

However, if I DUMP data, I get results like these:

((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))

Clearly the results from Cassandra are key/value pairs, as would be expected. I don't know why the schema generated by CqlStorage() would be so different.

This is really causing me problems trying to access the column values. I tried a naive approach of FLATTENing each tuple, then trying to access the values that way:

flattened = FOREACH data GENERATE
  FLATTEN(isbn),
  FLATTEN(booktitle),
  ...
values = FOREACH flattened GENERATE
  $1 AS ISBN,
  $3 AS BookTitle,
  ...

As soon as I try to access field $5, Pig complains about the index being out of bounds. (Curiously, flattened thinks it has the same schema as the original data.)

Somehow, CqlStorage seems to be generating the wrong schema, and that schema persists to projections of the original collection. Is there any way to work around this?

(I'm using Cassandra 1.2.8 and Pig 0.11.1)


Solution

  • This was resolved for, CCE: BinSedesTuple cannot be cast to String, by Applying the fix in https://issues.apache.org/jira/browse/CASSANDRA-5867.

    As Alex Lui, mentioned in my ticket:

    git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
    cd cassandra
    git checkout cassandra-1.2
    patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
    ant