I am confused how to shred my JSON data into a table because is not formatted with names for the arrays
The actual JSON file is much bigger (19K lines) so I only pulled a small portion of it out (the first two of top level and a few from within those.
DECLARE @txt1 varchar(max) = '{ "Rv0005": { "p.Glu540Asp": { "annotations": [ { "type": "drug", "drug": "moxifloxacin", "literature": "10.1128/AAC.00825-17;10.1128/JCM.06860-11", "confers": "resistance" } ], "genome_positions": [ 6857, 6858, 6859 ] }, "p.Ala504Thr": { "annotations": [ { "type": "drug", "drug": "ciprofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "fluoroquinolones", "confers": "resistance" }, { "type": "drug", "drug": "levofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "moxifloxacin", "confers": "resistance" }, { "type": "drug", "drug": "ofloxacin", "confers": "resistance" } ], "genome_positions": [ 6749, 6750, 6751 ] }, "p.Ala504Val": { "annotations": [ { "type": "drug", "drug": "ciprofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "fluoroquinolones", "confers": "resistance" }, { "type": "drug", "drug": "levofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "moxifloxacin", "confers": "resistance" }, { "type": "drug", "drug": "ofloxacin", "confers": "resistance" } ], "genome_positions": [ 6749, 6750, 6751 ] } }, "Rv2043c": { "p.Thr100Ile": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "literature": "10.1128/JCM.01214-17", "confers": "resistance" } ], "genome_positions": [ 2288942, 2288943, 2288944 ] }, "p.Thr160Ala": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "literature": "10.1128/JCM.01214-17", "confers": "resistance" } ], "genome_positions": [ 2288762, 2288763, 2288764 ] }, "c.101_102insT": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "confers": "resistance" } ], "genome_positions": [ 2289140, 2289141 ] } } }'
SELECT * FROM OPENJSON(@txt1)
The top level is a gene and this is just the data from two genes (Rv0005 = gene 1, Rv2043c = gene 2). Each gene can have multiple mutations (e.g. Rv0005 has a mutation at p.Glu540Asp and p.Ala504Thr) and each of those mutations have some data associated with it (literature, resistance, genomic positions, etc.). I know I can parse portions of the JSON and JSON array out via
SELECT * FROM OPENJSON(@txt1)
SELECT * FROM OPENJSON(@txt1, '$.Rv0005."p.Glu540Asp".genome_positions')
But I don't know how to shred the whole thing out without knowing what the keys/values are. In particular there are 35 unique genes (the top of the JSON tree) and each of the mutations are named under them but are unique (e.g. p.Glu540Asp, etc).
Ultimately I'd either like to pull the data into multiple normalized tables but honestly one big table would be fine like this
CREATE TABLE #Muts (gene varchar(max), mutations varchar(max), annotation_type varchar(max), annotation_drug varchar(max), annotation_literature varchar(max), annotation_confers varchar(max), genome_positions int )
and the data for the first couple of values would look like this (notice that some mutations confer resistance to multiple drugs)
gene | mutations | annotation_type | annotation_drug | annotation_literature | annotation_confers | genome_positions |
---|---|---|---|---|---|---|
Rv0005 | p.Glu540Asp | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6857 |
Rv0005 | p.Glu540Asp | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6858 |
Rv0005 | p.Glu540Asp | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6859 |
Rv0005 | p.Ala504Thr | drug | ciprofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6849 |
Rv0005 | p.Ala504Thr | drug | fluoroquinolones | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6849 |
Rv0005 | p.Ala504Thr | drug | levofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6849 |
Rv0005 | p.Ala504Thr | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6849 |
Rv0005 | p.Ala504Thr | drug | ofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6849 |
Rv0005 | p.Ala504Thr | drug | ciprofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6850 |
Rv0005 | p.Ala504Thr | drug | fluoroquinolones | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6850 |
Rv0005 | p.Ala504Thr | drug | levofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6850 |
Rv0005 | p.Ala504Thr | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6850 |
Rv0005 | p.Ala504Thr | drug | ofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6850 |
Rv0005 | p.Ala504Thr | drug | ciprofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6851 |
Rv0005 | p.Ala504Thr | drug | fluoroquinolones | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6851 |
Rv0005 | p.Ala504Thr | drug | levofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6851 |
Rv0005 | p.Ala504Thr | drug | moxifloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6851 |
Rv0005 | p.Ala504Thr | drug | ofloxacin | 10.1128/AAC.00825-17;10.1128/JCM.06860-11 | resistance | 6851 |
You have to use CROSS APPLY with OPENJSON when you want to "pivot" a JSON array to a tabular table.
The following query return the expected result:
SELECT a.[key] as gene, b.[key] as mutations, c.*, d.value as genome_positions
FROM OPENJSON(@txt1) a
CROSS APPLY OPENJSON(a.value) b
CROSS APPLY OPENJSON(b.value,'$.annotations')
WITH (
annotation_type nvarchar(100) '$.type'
, annotation_drug nvarchar(100) '$.drug'
, annotation_literature nvarchar(100) '$.literature'
, annotation_confers nvarchar(100) '$.confers'
) c
CROSS APPLY OPENJSON(b.value,'$.genome_positions') d
Result: