I am trying to make a cqlsh request on Cassandra. The idea is analyse the average length from different 'gene' registers that had been annotated in a genome. After enter the data, I have this athaliana.tab table:
id | chr | comments | end | orf | sense | start | type
--------------------------------------+-----+-------------------+----------+-----+-------+----------+------
d2ab2520-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G16340 | 5590338 | 0 | - | 5590241 | CDS
d4169c00-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G16610 | 5676495 | . | - | 5676429 | exon
a8c792c0-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G07485 | 2301889 | 0 | + | 2301665 | CDS
3bd5c0a0-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G51980 | 19326916 | . | - | 19326733 | exon
263b5b60-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G44990 | 17007808 | . | - | 17007542 | gene
67989a50-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G63110 | 23405144 | . | + | 23404821 | UTR
26f7f4a0-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G45180 | 17101207 | 0 | + | 17101109 | CDS
3743dc70-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G50840 | 18841644 | 0 | + | 18840965 | CDS
e5099940-6734-11e5-955c-234085c1edec | 1 | gene_id AT1G20620 | 7145780 | . | + | 7145691 | exon
2ba30620-6735-11e5-955c-234085c1edec | 1 | gene_id AT1G48180 | 17793717 | . | - | 17792449 | gene
The idea is obtain the subtraction between end-start columns data and calculate the average. I have tried by this way:
SELECT avg(end-start) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';
but it results in:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:14 no viable alternative at input '-' (SELECT avg([end]-...)">
Some ideas? Any help will be really appreciated.
Cassandra is designed for maximum read and write efficiency and has no built-in support for functions applying to multiple columns such as plus
or minus
nor for aggregates, i.e., functions applying to multiple rows, such as avg
or count
.
Until recently, the workaround was to store the results of your functions in a separate column and perform aggregations in your own application code.
However, Cassandra 2.2 and 3.0 now offer User-Defined Functions (UDFs) and User-Defined Aggregates (UDAs). You can now define your own functions, then have Cassandra invoke them for you. See the CQL spec chapters on UDFs and UDAs or this blog post.
First, create the following functions and aggregates:
USE athaliana;
CREATE FUNCTION minus (x int, y int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE java AS 'return x-y;';
CREATE FUNCTION avgState (state tuple<int,bigint>, val int) CALLED ON NULL INPUT RETURNS tuple<int,bigint> LANGUAGE java AS '
if (val != null) {
state.setInt(0, state.getInt(0)+1);
state.setLong(1, state.getLong(1)+val.intValue());
}
return state;
';
CREATE FUNCTION avgFinal (state tuple<int,bigint>) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS '
double r = 0;
if (state.getInt(0) == 0) return null;
r = state.getLong(1);
r /= state.getInt(0);
return Double.valueOf(r);
';
CREATE AGGREGATE avg(int) SFUNC avgState STYPE tuple<int,bigint> FINALFUNC avgFinal INITCOND (0, 0);
Then your query could be rewritten this way:
SELECT avg(minus(end,start)) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';
Note that you should only use aggregates on a single partition key; if this is not the case, a warning will be raised.