Search code examples
sql-servert-sqlgroup-bystring-concatenation

How to use SQL to concatenate two rows with grouping and CASE


I have some genomic data that is in SQL Server 2016 Express and it is currently shaped in a long format with a reference genome and test genomes split by a SubjectID, genes, and codons (e.g. a 3-tuple).

What I really need is to reshape my data into one which the tuples are concatenated together but ONLY when there is a mutation (as compared to the reference genome) in the tuple. This will be a more usable format for everyone.

My data looks like this

DECLARE @myTable TABLE 
(
     SubjectID VARCHAR(MAX), 
     country VARCHAR(MAX), 
     gene VARCHAR(MAX), 
     position INT, 
     ReferenceNucleotide VARCHAR(1), 
     TestNucleotide VARCHAR(1), 
     codon INT, 
     nucleotide_order INT 
)

INSERT INTO @myTable 
VALUES
('1-0002','India','gyrA',   65,'A','x', 92,1),
('1-0002','India','gyrA',   66,'T','x', 92,2),
('1-0002','India','gyrA',   67,'C','C', 92,3),
('1-0002','India','gyrA',   68,'T','T', 93,1),
('1-0002','India','gyrA',   69,'A','A', 93,2),
('1-0002','India','gyrA',   70,'C','C', 93,3),
('1-0002','India','gyrA',   71,'G','G', 94,1),
('1-0002','India','gyrA',   72,'A','A', 94,2),
('1-0002','India','gyrA',   73,'C','C', 94,3),
('1-0002','India','gyrA',   74,'A','A', 95,1),
('1-0002','India','gyrA',   75,'G','C', 95,2),
('1-0002','India','gyrA',   76,'C','C', 95,3),
('1-0002','India','gyrA',   77,'C','C', 96,1),
('1-0002','India','gyrA',   78,'T','T', 96,2),
('1-0002','India','gyrA',   79,'G','N', 96,3)

However, there are a couple of conditions

  • If all three nucleotides are the same for the reference and test genome I want a 'WT'
  • If there is any difference in nucleotides, I want the 3-tuple from the test genome (in the nucleotide order)
  • I need to group by SubjectID and gene because I have lots of Subjects & genes

And my result would look like

1-0002 India gyrA 92 xxC
1-0002 India gyrA 93 WT
1-0002 India gyrA 94 WT
1-0002 India gyrA 95 ACC
1-0002 India gyrA 96 CTN

I can identify where the codons are that need to have the 3-tuple but am struggling with how to concatentate them

DECLARE @myCodons TABLE (SubjectID varchar(max), country varchar(max), gene  varchar(max), codon int, WT int)

INSERT INTO @myCodons
    SELECT 
        SubjectID, country, gene, codon,
        SUM(CASE WHEN RefNucleotide=TestNucleotide THEN 0 ELSE 1 END) AS WT
    FROM 
        @myTable
    GROUP BY 
        SubjectID, country, gene, codon 

SELECT * 
FROM @myCodons 
ORDER BY codon

Solution

  • Start with something like:

    select T1.SubjectID, T1.country, T1.gene, T1.codon, 
        T1.RefGenome + T2.RefGenome + T3.RefGenome RefGenome,
        T1.TestGenome + T2.TestGenome + T3.TestGenome TestGenome
    from @myTable T1
        inner join @myTable T2 on T1.SubjectID = T2.SubjectID and T1.country = T2.country
            and T1.gene = T2.gene and T1.codon = T2.codon and T2.nucleotide_order = 2
        inner join @myTable T3 on T1.SubjectID = T3.SubjectID and T1.country = T3.country
            and T1.gene = T3.gene and T1.codon = T3.codon and T3.nucleotide_order = 3
    where T1.nucleotide_order = 1
    

    You can then build on this with case statements to work out whether to display the test genome or 'WT'