Search code examples
cassandracql3databasenosql

What is the best way to perform a subtraction on Cassandra by cqlsh request?


I am trying to make a cqlsh request on Cassandra. The idea is analyse the average length from different 'gene' registers that had been annotated in a genome. After enter the data, I have this athaliana.tab table:

 id                                   | chr | comments          | end      | orf | sense | start    | type
--------------------------------------+-----+-------------------+----------+-----+-------+----------+------
 d2ab2520-6734-11e5-955c-234085c1edec |   1 | gene_id AT1G16340 |  5590338 |   0 |     - |  5590241 |  CDS
 d4169c00-6734-11e5-955c-234085c1edec |   1 | gene_id AT1G16610 |  5676495 |   . |     - |  5676429 | exon
 a8c792c0-6734-11e5-955c-234085c1edec |   1 | gene_id AT1G07485 |  2301889 |   0 |     + |  2301665 |  CDS
 3bd5c0a0-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G51980 | 19326916 |   . |     - | 19326733 | exon
 263b5b60-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G44990 | 17007808 |   . |     - | 17007542 | gene
 67989a50-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G63110 | 23405144 |   . |     + | 23404821 |  UTR
 26f7f4a0-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G45180 | 17101207 |   0 |     + | 17101109 |  CDS
 3743dc70-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G50840 | 18841644 |   0 |     + | 18840965 |  CDS
 e5099940-6734-11e5-955c-234085c1edec |   1 | gene_id AT1G20620 |  7145780 |   . |     + |  7145691 | exon
 2ba30620-6735-11e5-955c-234085c1edec |   1 | gene_id AT1G48180 | 17793717 |   . |     - | 17792449 | gene

The idea is obtain the subtraction between end-start columns data and calculate the average. I have tried by this way:

SELECT avg(end-start) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';

but it results in:

SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:14 no viable alternative at input '-' (SELECT avg([end]-...)">

Some ideas? Any help will be really appreciated.


Solution

  • Cassandra is designed for maximum read and write efficiency and has no built-in support for functions applying to multiple columns such as plus or minus nor for aggregates, i.e., functions applying to multiple rows, such as avg or count.

    Until recently, the workaround was to store the results of your functions in a separate column and perform aggregations in your own application code.

    However, Cassandra 2.2 and 3.0 now offer User-Defined Functions (UDFs) and User-Defined Aggregates (UDAs). You can now define your own functions, then have Cassandra invoke them for you. See the CQL spec chapters on UDFs and UDAs or this blog post.

    First, create the following functions and aggregates:

    USE athaliana;
    
    CREATE FUNCTION minus (x int, y int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE java AS 'return x-y;';
    
    CREATE FUNCTION avgState (state tuple<int,bigint>, val int) CALLED ON NULL INPUT RETURNS tuple<int,bigint>  LANGUAGE java AS '
        if (val != null) {
          state.setInt(0, state.getInt(0)+1);
          state.setLong(1, state.getLong(1)+val.intValue());
        }
        return state;
    ';
    
    CREATE FUNCTION avgFinal (state tuple<int,bigint>) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS '
        double r = 0;
        if (state.getInt(0) == 0) return null;
        r = state.getLong(1);
        r /= state.getInt(0);
        return Double.valueOf(r);
      ';
    
    CREATE AGGREGATE avg(int) SFUNC avgState STYPE tuple<int,bigint> FINALFUNC avgFinal INITCOND (0, 0);
    

    Then your query could be rewritten this way:

    SELECT avg(minus(end,start)) FROM athaliana.tab WHERE chr = '1' AND type = 'gene';
    

    Note that you should only use aggregates on a single partition key; if this is not the case, a warning will be raised.