Search code examples
mysqlsqlleft-joinexistsimpala

SQL find records from table 1 that are not in table 2 OR in table 2 with condition


I see many examples on how to find records that are not in another table, but I'm having a lot of trouble finding records that are either not in table 2, or are in table two, but the freq column value is less than 10%.

I'm first joining a list of variants with ensembl gene names for BRCA1, BRCA2 and any genes that start with BRC, where a variant falls between the start and stop position.

From those results, I would like to check kaviar allele frequencies (k) and return results that either do not have an entry in the kaviar table, or results that are in the kaviar table with an alle_freq of < .10. The results from the first join need to be matched with kaviar by chr, pos, ref and alt.

I've tried:

SELECT DISTINCT *
FROM puzz p, ensembl ens, kaviar k
WHERE (ens.gene_name IN ('BRCA1', 'BRCA2')
   OR ens.gene_name LIKE 'RAS%')
AND p.chr = ens.chromosome
AND p.pos >= ens.start AND p.pos <= ens.stop
AND NOT EXISTS
 (SELECT k.chromosome, k.pos, k.ref, k.alt, k.alle_freq, k.alle_cnt
  FROM public_hg19.kaviar k
  WHERE  p.chr = k.chromosome
  AND p.pos = k.pos
  AND p.ref = k.ref
  AND p.alt = k.alt
  )
  AND p.pos = k.pos
  AND p.ref = k.ref
  AND p.alt = k.alt
  AND k.alle_freq < .10

And I've also tried:

WITH puzz AS (
SELECT * 
FROM puzz p
WHERE p.gt IS NOT NULL
)

SELECT DISTINCT t1.*, kav.*
FROM
(SELECT puzz.*, ens.*
FROM puzz, public_hg19.ensembl_genes AS ens
WHERE (ens.gene_name IN IN ('BRCA1', 'BRCA2')
   OR ens.gene_name LIKE 'RAS%')
AND puzz.chr = ens.chromosome
AND puzz.pos BETWEEN ens.start AND ens.stop
AND ens.chromosome NOT LIKE "H%") t1

LEFT JOIN  

public_hg19.kaviar as kav
ON kav.chromosome = t1.chr
AND kav.pos = t1.pos
AND kav.ref = t1.ref
AND kav.alt = t1.alt
AND (kav.alle_freq < .10 OR kav.alle_freq IS NULL)

SOLUTION: Thanks to @John Bollinger for providing the framework for the solution.

Because Impala does not index, the quickest solution involved creating a temporary table that narrows down the number of rows passed to string operations, as shown in the ens temp table.

WITH ens AS (
 SELECT DISTINCT chromosome as chr, start, stop, gene_name
   FROM public_hg19.ensembl_genes
  WHERE (gene_name IN ( 'BRCA1', 'BRCA2')
     OR gene_name LIKE 'RAS%')
    AND chromosome NOT LIKE "H%"
)
SELECT p.*, k.chromosome, k.pos, k.id, k.ref, k.alt,
       k.qual, (k.alle_freq * 100) as kav_freqPct, k.alle_cnt as kav_count
  FROM
(SELECT DISTINCT p.sample_id, p.chr, p.pos, p.id,
                 p.ref, p.alt, p.qual, p.filter,
                 ens.gene_name
  FROM ens, p7_ptb.itmi_102_puzzle p
  WHERE p.chr = ens.chr
  AND p.gt IS NOT NULL
  AND p.pos >= ens.start AND p.pos <= ens.stop
) AS p
LEFT JOIN public_hg19.kaviar k
      ON p.chr = k.chromosome
      AND p.pos = k.pos
      AND p.ref = k.ref
      AND p.alt = k.alt
WHERE COALESCE(k.alle_freq, 0.0) < .10

The following line, as pointed out by @Gordon Linoff could also be

WHERE (k.alle_freq IS NULL OR k.alle_freq < 0.10)

Both final clauses return the same results, but on impala, the coalesce function is somehow faster.


Solution

  • The two queries you present don't seem to match up. Table names differ, and some of the filter conditions simply don't correlate. In particular, from whence came the condition AND ens.chromosome NOT LIKE "H%" (with its incorrect quotes)?

    I do think your outer join approach is promising, but I don't understand why you need a CTE or an inline view.

    Also, "any gene that starts with 'BRC'" includes 'BRCA1' and 'BRCA2', so you don't need to test those separately. Removing redundant conditions may improve performance a little.

    Furthermore, if happens to be the case that the structure of your data will preclude duplicate rows anyway, then explicitly selecting DISTINCT rows cannot help you, but might harm you. (Nevertheless, I follow your lead by including it in my suggested query.) If there are many results then SELECT DISTINCT is expensive; especially so if you are selecting a lot of columns.

    This seems like it accurately expresses the query you describe:

    SELECT DISTINCT
      p.sample_id, p.chr, p.pos, p.ref,
      p.alt, p.gt, p.qual, p.filter
    FROM
      p7_ptb.itmi_102_puzzle p
      join public_hg19.ensembl_genes ens
        ON p.chr = ens.chromosome
      left join public_hg19.kaviar k
        ON p.chr = k.chromosome
          AND p.pos = k.pos
          AND p.ref = k.ref
          AND p.alt = k.alt
    WHERE
      ens.gene_name LIKE 'BRC%'
      AND ens.chromosome NOT LIKE 'H%'
      AND p.pos BETWEEN ens.start AND ens.stop
      AND COALESCE(k.alle_freq, 0.0) < .10
    

    If it's not fast enough for you then you'll want to examine your query plan to determine what the bottleneck is rather than trying to guess.