Search code examples
mysqlsqlsql-serverimpala

SQL WHERE NOT EXISTS query not returning results


I'm having trouble constrsucting a query and getting it to run on impala. I created the following working query to join two tables:

SELECT *
FROM illuminavariant as vcf, ensembl_genes as ens 
WHERE vcf.filter = "PASS" 
AND vcf.qual > 100 
AND vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop

Now I'm trying to write a query that finds all variants WHERE vcf.filter = "PASS" and vcf.qual > 100, but do not have a match by chromosome and position.

I've tried this:

SELECT * 
FROM p7dev.illumina_test, p7dev.ensembl_test
WHERE NOT EXISTS(
  SELECT * 
  FROM p7dev.illumina_test as vcf, p7dev.ensembl_test as ens
  WHERE vcf.chromosome = ens.chromosome  
  AND vcf.position BETWEEN ens.start AND ens.stop  
 )

But that's not returning any results. I think a WITH clause might do the trick, but I'd really appreciate if anybody could help me wrap my head around the logic of how that would work. Thanks so much!


Solution

  • Since you're looking for variants that are not correlated to any ensemble, it seems strange that you would form a cross join of variants and ensembles from which to filter out rows. If that's really what you want, though, then this ought to do it:

    SELECT *
    FROM illuminavariant as vcf, ensembl_genes as ens 
    WHERE vcf.filter = "PASS" 
    AND vcf.qual > 100 
    AND (
        vcf.chromosome != ens.chromosome
        OR vcf.position < ens.start
        OR vcf.position > ens.stop
      )
    

    That just negates the condition correlating variant rows with ensemble rows.

    I suspect what you really want is more like this, though:

    SELECT vcf.*
    FROM
      illuminavariant as vcf
      LEFT JOIN ensembl_genes as ens 
        ON vcf.chromosome = ens.chromosome
        AND vcf.position BETWEEN ens.start AND ens.stop
    WHERE
      vcf.filter = "PASS" 
      AND vcf.qual > 100
      AND ens.chromosome IS NULL
    

    That performs the same join as your first query, but as a left join. The rows that actually represent matches are then filtered out by the ens.chromosome IS NULL condition. It returns only the columns of the variant table, as the whole point is to find variants that have no corresponding row in the ensemble table.