I have some genomic data that is in SQL Server 2016 Express and it is currently shaped in a long format with a reference genome and test genomes split by a SubjectID
, genes, and codons (e.g. a 3-tuple).
What I really need is to reshape my data into one which the tuples are concatenated together but ONLY when there is a mutation (as compared to the reference genome) in the tuple. This will be a more usable format for everyone.
My data looks like this
DECLARE @myTable TABLE
(
SubjectID VARCHAR(MAX),
country VARCHAR(MAX),
gene VARCHAR(MAX),
position INT,
ReferenceNucleotide VARCHAR(1),
TestNucleotide VARCHAR(1),
codon INT,
nucleotide_order INT
)
INSERT INTO @myTable
VALUES
('1-0002','India','gyrA', 65,'A','x', 92,1),
('1-0002','India','gyrA', 66,'T','x', 92,2),
('1-0002','India','gyrA', 67,'C','C', 92,3),
('1-0002','India','gyrA', 68,'T','T', 93,1),
('1-0002','India','gyrA', 69,'A','A', 93,2),
('1-0002','India','gyrA', 70,'C','C', 93,3),
('1-0002','India','gyrA', 71,'G','G', 94,1),
('1-0002','India','gyrA', 72,'A','A', 94,2),
('1-0002','India','gyrA', 73,'C','C', 94,3),
('1-0002','India','gyrA', 74,'A','A', 95,1),
('1-0002','India','gyrA', 75,'G','C', 95,2),
('1-0002','India','gyrA', 76,'C','C', 95,3),
('1-0002','India','gyrA', 77,'C','C', 96,1),
('1-0002','India','gyrA', 78,'T','T', 96,2),
('1-0002','India','gyrA', 79,'G','N', 96,3)
However, there are a couple of conditions
And my result would look like
1-0002 India gyrA 92 xxC
1-0002 India gyrA 93 WT
1-0002 India gyrA 94 WT
1-0002 India gyrA 95 ACC
1-0002 India gyrA 96 CTN
I can identify where the codons are that need to have the 3-tuple but am struggling with how to concatentate them
DECLARE @myCodons TABLE (SubjectID varchar(max), country varchar(max), gene varchar(max), codon int, WT int)
INSERT INTO @myCodons
SELECT
SubjectID, country, gene, codon,
SUM(CASE WHEN RefNucleotide=TestNucleotide THEN 0 ELSE 1 END) AS WT
FROM
@myTable
GROUP BY
SubjectID, country, gene, codon
SELECT *
FROM @myCodons
ORDER BY codon
Start with something like:
select T1.SubjectID, T1.country, T1.gene, T1.codon,
T1.RefGenome + T2.RefGenome + T3.RefGenome RefGenome,
T1.TestGenome + T2.TestGenome + T3.TestGenome TestGenome
from @myTable T1
inner join @myTable T2 on T1.SubjectID = T2.SubjectID and T1.country = T2.country
and T1.gene = T2.gene and T1.codon = T2.codon and T2.nucleotide_order = 2
inner join @myTable T3 on T1.SubjectID = T3.SubjectID and T1.country = T3.country
and T1.gene = T3.gene and T1.codon = T3.codon and T3.nucleotide_order = 3
where T1.nucleotide_order = 1
You can then build on this with case statements to work out whether to display the test genome or 'WT'