Parse JSON arrays using OPENJSON to table

I am confused how to shred my JSON data into a table because is not formatted with names for the arrays

The actual JSON file is much bigger (19K lines) so I only pulled a small portion of it out (the first two of top level and a few from within those.

DECLARE @txt1 varchar(max) = '{ "Rv0005": { "p.Glu540Asp": { "annotations": [ { "type": "drug", "drug": "moxifloxacin", "literature": "10.1128/AAC.00825-17;10.1128/JCM.06860-11", "confers": "resistance" } ], "genome_positions": [ 6857, 6858, 6859 ] }, "p.Ala504Thr": { "annotations": [ { "type": "drug", "drug": "ciprofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "fluoroquinolones", "confers": "resistance" }, { "type": "drug", "drug": "levofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "moxifloxacin", "confers": "resistance" }, { "type": "drug", "drug": "ofloxacin", "confers": "resistance" } ], "genome_positions": [ 6749, 6750, 6751 ] }, "p.Ala504Val": { "annotations": [ { "type": "drug", "drug": "ciprofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "fluoroquinolones", "confers": "resistance" }, { "type": "drug", "drug": "levofloxacin", "confers": "resistance" }, { "type": "drug", "drug": "moxifloxacin", "confers": "resistance" }, { "type": "drug", "drug": "ofloxacin", "confers": "resistance" } ], "genome_positions": [ 6749, 6750, 6751 ] } }, "Rv2043c": { "p.Thr100Ile": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "literature": "10.1128/JCM.01214-17", "confers": "resistance" } ], "genome_positions": [ 2288942, 2288943, 2288944 ] }, "p.Thr160Ala": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "literature": "10.1128/JCM.01214-17", "confers": "resistance" } ], "genome_positions": [ 2288762, 2288763, 2288764 ] }, "c.101_102insT": { "annotations": [ { "type": "drug", "drug": "pyrazinamide", "confers": "resistance" } ], "genome_positions": [ 2289140, 2289141 ] } } }'

SELECT * FROM OPENJSON(@txt1)

The top level is a gene and this is just the data from two genes (Rv0005 = gene 1, Rv2043c = gene 2). Each gene can have multiple mutations (e.g. Rv0005 has a mutation at p.Glu540Asp and p.Ala504Thr) and each of those mutations have some data associated with it (literature, resistance, genomic positions, etc.). I know I can parse portions of the JSON and JSON array out via

SELECT * FROM OPENJSON(@txt1) 
SELECT * FROM OPENJSON(@txt1, '$.Rv0005."p.Glu540Asp".genome_positions')

But I don't know how to shred the whole thing out without knowing what the keys/values are. In particular there are 35 unique genes (the top of the JSON tree) and each of the mutations are named under them but are unique (e.g. p.Glu540Asp, etc).

Ultimately I'd either like to pull the data into multiple normalized tables but honestly one big table would be fine like this

CREATE TABLE #Muts (gene varchar(max), mutations varchar(max), annotation_type varchar(max), annotation_drug varchar(max), annotation_literature varchar(max), annotation_confers  varchar(max), genome_positions int )

and the data for the first couple of values would look like this (notice that some mutations confer resistance to multiple drugs)

gene	mutations	annotation_type	annotation_drug	annotation_literature	annotation_confers	genome_positions
Rv0005	p.Glu540Asp	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6857
Rv0005	p.Glu540Asp	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6858
Rv0005	p.Glu540Asp	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6859
Rv0005	p.Ala504Thr	drug	ciprofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6849
Rv0005	p.Ala504Thr	drug	fluoroquinolones	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6849
Rv0005	p.Ala504Thr	drug	levofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6849
Rv0005	p.Ala504Thr	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6849
Rv0005	p.Ala504Thr	drug	ofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6849
Rv0005	p.Ala504Thr	drug	ciprofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6850
Rv0005	p.Ala504Thr	drug	fluoroquinolones	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6850
Rv0005	p.Ala504Thr	drug	levofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6850
Rv0005	p.Ala504Thr	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6850
Rv0005	p.Ala504Thr	drug	ofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6850
Rv0005	p.Ala504Thr	drug	ciprofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6851
Rv0005	p.Ala504Thr	drug	fluoroquinolones	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6851
Rv0005	p.Ala504Thr	drug	levofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6851
Rv0005	p.Ala504Thr	drug	moxifloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6851
Rv0005	p.Ala504Thr	drug	ofloxacin	10.1128/AAC.00825-17;10.1128/JCM.06860-11	resistance	6851

Solution

You have to use CROSS APPLY with OPENJSON when you want to "pivot" a JSON array to a tabular table.

The following query return the expected result:

SELECT a.[key] as gene, b.[key] as mutations, c.*, d.value as genome_positions
FROM OPENJSON(@txt1) a
CROSS APPLY OPENJSON(a.value) b
CROSS APPLY OPENJSON(b.value,'$.annotations')
WITH ( 
    annotation_type nvarchar(100) '$.type'
    , annotation_drug nvarchar(100) '$.drug'
    , annotation_literature nvarchar(100) '$.literature'
    , annotation_confers nvarchar(100) '$.confers'
) c
CROSS APPLY OPENJSON(b.value,'$.genome_positions') d

Result:

db<>fiddle