Represent an XML Schema and its Data in a SQL Server Graph Schema

I have a problem where I need to represent an XML Schema and its data inside of a SQL Server Database. I need to be able to access the data in a way that will allow me to create either an XML or JSON file.

I have looked at couple of solutions to this problem. First creating a traditional relational database and storing the XML data in a table representing the hierarchical structure by use of a self referencing parentId. This structure seems OK but as the size of XML is large the accessing of data is slow, as I must use a lot of recursions to obtain the hierarchical data I need. The performance of this design will tend to worsen as the amount of data increases.

Secondly, I looked at the use of creating a Graph Schema inside of SQL Server. Assigning each XML element as a node table and the xml element attributes as columns in that table. I then created a simple ‘isParentOf’ edge table, inserting this relationship between the different xml elements into the table. However, as each element is a separate node it is making queries cumbersome.

I know there isn’t a direct correlation between the XML Schema structure and Databases and have read articles on the complexity of such problems. But I wanted to reach out to the community to see if it is possible to achieve my goal using the SQL Graph Databases as this seems to be the best fit, in that I can define my elements and then create the different relationships.

I have provided some sample XML data below which contains the different permutations of the XML that I am currently working with in terms of element hierarchies, attributes, and data.

<?xml version="1.0" encoding="utf-8"?>
<Document xmlns='http://mydocument.com/schema/1'>
  <BankStatement frequency='monthly'>
    <Customer>
      <AcctNo>012-3456789</AcctNo>
      <Name type="full">John Doe</Name>
      <Street>123 Street Road</Street>
      <City>London</City>
    </Customer>
    <BeginDate>18/10/2022</BeginDate>
    <EndDate>18/11/2022</EndDate>
  </BankStatement>
</Document>

Solution

First creating a traditional relational database and storing the XML data in a table representing the hierarchical structure by use of a self referencing parentId...The performance of this design will tend to worsen as the amount of data increases.

No, it likely won't. When properly architected and indexed, the search time of your tables will be O(log(n)) because indexes use B-Tree data structures.

Let's say your table had 1 million rows in it. In the worst case scenario, log2(1 million) = 30. That's only 30 nodes of the B-Tree that would need to be seeked through to find your data. If your table grew to 1 billion rows, log2(1 billion) = 40. These are extremely small numbers for a computer to search through. (It's actually usually less than this because of something called the fan-out factor.)

Typically a recursive CTE can easily be used to efficiently crawl a hierarchical structure too. Hundreds of thousands of hierarchical rows can be crawled and related appropriately in under a second.

Aside from all of that, I don't even see a hierarchical relationship in your example data. Rather it appears to be a traditional data model problem that a relational structure would be well suited for.

Tables that I would recommend defining would be Customers, Accounts, maybe Addresses if a Customer can have more than one Address, and BankStatements. I'm sure your data probably has other relevant entities too.