Search code examples
mysqlsqlcountimpala

Divide columns values by total rows in impala


SELECT COUNT(DISTINCT cgi.sample_idSince Impala does not allow the SET operation, or subqueries in a select statement, I'm having a hard time figuring out how to divide column values by the total number of rows returned. My ultimate goal is to calculate minor allele frequency at each chr:start position.

My data is structured as follows:

| chr | start    | stop     | ref | allele1seq | allele2seq | sample_id | 
|  6  | 66720709 | 66720710 |  A  |      A     |     T      | 101-46-3  |
|  7  | 66720809 | 66720810 |  GG |      GA    |     GG     | 101-46-3  |

I'd like to do something similar to the query below:

WITH vars as
(SELECT cgi.chr, cgi.start, concat(cgi.chr, ':', CAST(cgi.start AS STRING)) as pos, cgi.ref, cgi.allele1seq, cgi.allele2seq,
    CASE 
        WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN '1'  
        WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN '1' 
        WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN '2' 
        ELSE '0' END as ma_count
    FROM comgen_variants as cgi)

SELECT vars.*, (CAST(vars.ma_count as INT)/
((SELECT COUNT(DISTINCT cgi.sample_id) from comgen_variants as cgi) * 2)) as maf
FROM vars

Where my desired output would like like:

| chr | start    | ref | allele1seq | allele2seq | ma_count | maf |  
|  6  | 66720709 |  A  |      A     |     T      |     1    | .05 | 
|  7  | 66720809 |  GG |      GG    |     GG     |     0    |  0  |

In addition to figuring out a way to divide by row count, I also need to group the results by chr and pos, and then count how many times each alternate allele (where allele1seq and allele2seq are not equal to ref) occurs instead of simply counting per row as I have above; but I haven't gotten that far due to the counting issue.

Thanks in advance for your help.


Solution

  • It looks like you could just calculate the total number of distinct sample_ids*2 in advance, and then use that for the subsequent query, since that value doesn't change per row. If the value did depend on the row, you might want to take a look at the analytic/window functions available to Impala.

    But, since it doesn't look like you need to, you could do something like the following:

    WITH total AS 
    (SELECT COUNT(DISTINCT sample_id) * 2 AS total FROM comgen_variants)
    
    SELECT cgi.*,
           (CASE 
              WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN 1  
              WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN 1 
              WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN 2
              ELSE 0 END) / total.total AS maf
    FROM comgen_variants AS cgi, total;
    

    I'm not sure that this is what the minor allele frequency is, though; it seems like you'd want to choose the second most common allele frequency for each locus?