I got a JSON file being generated each day. It could have different structure depends on the source data (attributes could be more or less in each JSON file). For example: File 1 has attributes: productCode, productType, sellerName, sellerCountry File 2 has all above attributes plus sellerState next file could have more or less attributes than file 2
How could I flatten this kind of JSON file into a SQL table dynamically?
File1:
[
{
"productCode": "00001",
"productType": "Food",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "UK"
}
]
},
{
"productCode": "00002",
"productType": "Clothing"
}
]
File2:
[
{
"productCode": "00001",
"productType": "Food",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "UK"
}
]
},
{
"productCode": "00002",
"productType": "Clothing"
},
{
"productCode": "00003",
"productType": "Toy",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "AUS",
"sellerState": "VIC"
}
]
}
]
It is possible to do this dynamically using dynamic SQL generation, but in most cases that is not necessary.
In JSON files it is common to exclude properties for individual entities if their value is null
. This means that although it looks like the output is dynamic systems that generate these files will usually conform to a standard schema. In this case there is a maximal structure that is possible but if you do not know this schema you can determine it from comparing files.
Once you have a fixed schema, we can read all files according to that schema and the missing properties will be imported with a null
value.
A complicating factor in this input is that the sellerDetails
is a nested array, in SQL this is a new dimension or table, so we need to join on the results of extracting the root element properties which will flatten the data.
This is described in How to use OPENJSON on multiple rows
Let us therefore define the maximal schema assuming that sellerDetails
has the 3 properties: sellerName
, sellerCountry
, sellerState
. If you identify more fields, add them to the WITH
declaration.
This fiddle demonstrates a solution in MS SQL Server: http://sqlfiddle.com/#!18/7d5fd/2
I have unioned the two inputs into the same resultset so you can compare the output
SELECT DISTINCT product.ProductCode, product.productType, seller.*
FROM
(
SELECT @json1 as json
UNION
SELECT @json2
) input
CROSS APPLY OPENJSON (json)
WITH (
productCode varchar(10),
productType varchar(50),
sellerDetails NVARCHAR(MAX) as JSON
) product
OUTER APPLY OPENJSON (product.sellerDetails)
WITH (
sellerName varchar(50),
sellerCountry varchar(50),
sellerState varchar(50)
) seller
ProductCode | productType | sellerName | sellerCountry | sellerState |
---|---|---|---|---|
00001 | Food | Cosco | UK | NULL |
00002 | Clothing | NULL | NULL | NULL |
00003 | Toy | Cosco | AUS | VIC |
Notice the code ignores the missing
sellerState
from the first file and injectsNULL
Thank you to @Charlieface for pointing out that we should consider using OUTER APPLY OPENJSON (product.sellerDetails)
for nested OPENJSON
calls. CROSS APPLY
behaves like an INNER JOIN
, so using cross apply here results in fintering out product
records that did not have any sellerDetails
records, like 'Clothing' in this example.
DISTINCT
is also added to this query to remove the duplicate entries, product 00002 Clothing
for instance appears in both files.
These are the variable declarations:
DECLARE @json1 varchar(max) = '[
{
"productCode": "00001",
"productType": "Food",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "UK"
}
]
},
{
"productCode": "00002",
"productType": "Clothing"
}
]
';
DECLARE @json2 varchar(max) = '[
{
"productCode": "00001",
"productType": "Food",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "UK"
}
]
},
{
"productCode": "00002",
"productType": "Clothing"
},
{
"productCode": "00003",
"productType": "Toy",
"sellerDetails": [
{
"sellerName": "Cosco",
"sellerCountry": "AUS",
"sellerState": "VIC"
}
]
}
]
';