Search code examples
sql-serverxmlssisdatabase-migration

Data Migration From Multilevel XML to Single Table With SSIS


Goal

I am attempting to migrate data from a multi-layered XML file with nested elements to a single table.

System Parameters

  • MS SQL Server Management Studio
  • Microsoft Visual Studio SSIS

The XML File

Here's the XSD for the XML file I have

As you can see, it is not just a simple layout. The whole thing is wrapped in a 'People' tag, and there about ~1000 'Person's. Each 'Person' tag contains the following elements of information. The XML goes like this:

Person

  • FirstName
  • LastName
  • Biography
  • Expertise
  • Image
  • Link
  • Books
    • Book
      • Year
      • Details
    • Book
      • Year
      • Details
    • ... (There can be many of these)
  • Articles
    • Article
      • Year
      • Details
    • Article
      • Year
      • Details
    • ... (There can be many of these)
  • Papers
  • Artwork
  • Websites

As a side note, there can be multiple

Question

Now, here's my question. How do I put all of this information into a single SQL table with SSIS? I know that the topology of the XML file doesn't directly map onto the topology of a table, but I want to force it. I want a separate row for each 'Person'. I also want enough columns to capture the maximum number of 'Book's that any one person in my data set has. Maybe this means creating 'Book_1', 'Book_2', 'Book_3'.. etc. columns in the final table. I do not want a series of tables with foreign and primary keys. I want a separate column for each 'Book', 'Year' 'Details' corresponding to each element. To be more clear, let me show you want I mean with an example.

Example XML File

If I have a 'Books' tag with 3 'Book' elements, I want to create a separate column for each book:

  • John
  • Steinbeck
  • ...
  • Books
    • East of Eden
      • 1952
      • A great book
    • Of Mice And Men
      • 1937
      • Book about mice
    • The Grapes of Wrath
      • 1939
      • Book about angry grapes
  • Articles
    • Article
  • ...

Example Resulting Table In SQL Database

I want the table to look like this, and look like this for all of the nested elements of the XML file. Is it possible to do a kind of flattened import to a database in this manner using SSIS?

Thanks! I really appreciate it.

Additional Notes

  • Some of the entries in the XML file contain up to 60,000 characters. What data type does that mean I should use?

Snippet of Actual XML File

Here's an example of the XML file. The actual XML has many <Person>'s.

<?xml version="1.0" encoding="UTF-8" ?>
<People>
    <Person>
        <FirstName>Eliza</FirstName>
        <LastName>Ablovatski</LastName>
        <Biography>
            <![CDATA[<p>Eliza Ablovatski joined the Kenyon history department in 2003, after graduate work in East Central European history at Columbia University and research and fellowships in Munich and Berlin, Germany and Budapest, Hungary. She teaches classes on Europe from 1500 to the present, focusing on the nineteenth and twentieth centuries, Germany, Russia, the Habsburg Monarchy, film, nationalism and identity, gender, race, and the interwar period.</p>
<p>Her dissertation and first book,&nbsp;<em>Revolution and Political Violence in Central Europe: The Deluge of 1919</em> (forthcoming from Cambridge University Press), focus on the revolutionary upheavals in Munich and Budapest following the First World War, and their relationship to political violence and antisemitism. She is currently researching the occupation of Austria (1945-1955) at the end of the Second World War, and the nuclear idea in postwar Europe. She has also researched and written extensively on the history of Jews in the former Habsburg regional capital of Czernowitz (now Ukraine).</p>]]>
        </Biography>
        <Expertise>
            <![CDATA[<p>Modern Europe, especially Germany and Central/East Central Europe in the nineteenth and twentieth centuries; European Jewish and women's history, East European and German film and literature, socialism, war, and revolution.</p>]]>
        </Expertise>
        <Image>http://www.kenyon.edu/images/directory/ablovatski.jpg</Image>
        <Link>http://www.kenyon.edu/directories/campus-directory/biography/eliza-ablovatski/</Link>
        <Books>
            <Book>
                <Year></Year>
                <Details>
                    <![CDATA[<p><em>Zwischen Pruth und Jordan. Lebenserinnerungen Czernowitzer Juden</em><em>&nbsp;,&nbsp;</em>with Gaby Coldewey and others K&ouml;ln: B&ouml;hlau Verlag, 2003</p>]]>
                </Details>
            </Book>
            <Book>
                <Year></Year>
                <Details>
                    <![CDATA[<p><em>Czernowitz ist gewen an alt jiddische Stdt: &Uuml;berlebende berichten,</em>&nbsp;With Gaby Coldewey and others. First Edition: Czernowitz,Ukraine: distributed by the Heinrich-B&ouml;ll-Stiftung, 1998 Second Edition: Berlin, 1999 (Third edition: Potsdam, forthcoming 2009)</p>]]>
                </Details>
            </Book>
        </Books>
        <Articles>
            <Article>
                <Year></Year>
                <Details>
                    <![CDATA[<p>"The Central European Revolutions of 1919 and the Myth of Judeo-Bolshevism,"&nbsp;<em>European Review of History, Vol. 17/ Issue 3: Cosmopolitanism, Nationalism and the Jews of East Central Europe (2010), 473-489.</em></p>]]>
                </Details>
            </Article>
            <Article>
                <Year></Year>
                <Details>
                    <![CDATA[<p>"Between Red Army and White Guard: Women in Budapest, 1918-1919," in&nbsp;<em>Gender and War in Twentieth-Century Eastern Europe,</em>&nbsp;edited by Maria Bucur and Nancy Wingfield&nbsp;Bloomington: Indiana University Press 2006</p>]]>
                </Details>
            </Article>
            <Article>
                <Year></Year>
                <Details>
                    <![CDATA[<p>"The Girl with the Titus-head: Women in Revolution in Munich and Budapest, 1919"&nbsp;<em>Nationalities Papers&nbsp;</em>28/3 (September 2000), 541-550</p>]]>
                </Details>
            </Article>
        </Articles>
        <Papers>
        </Papers>
        <Artwork>
        </Artwork>
        <Websites>
        </Websites>
    </Person>
...This goes on to include many <Person> elements. (About 1000)
</People>

Solution

  • Thx for the actual XML! The following query will get your values out of the XML. It will generate IDs for them to store all data in related tables.

    Attention: I had to double the ' sign in woman's and I added a second person to show the approach:

    DECLARE @x XML=
    '<?xml version="1.0" encoding="UTF-8" ?>
    <People>
        <Person>
            <FirstName>Eliza</FirstName>
            <LastName>Ablovatski</LastName>
            <Biography>
                <![CDATA[<p>Eliza Ablovatski joined the Kenyon history department in 2003, after graduate work in East Central European history at Columbia University and research and fellowships in Munich and Berlin, Germany and Budapest, Hungary. She teaches classes on Europe from 1500 to the present, focusing on the nineteenth and twentieth centuries, Germany, Russia, the Habsburg Monarchy, film, nationalism and identity, gender, race, and the interwar period.</p>
    <p>Her dissertation and first book,&nbsp;<em>Revolution and Political Violence in Central Europe: The Deluge of 1919</em> (forthcoming from Cambridge University Press), focus on the revolutionary upheavals in Munich and Budapest following the First World War, and their relationship to political violence and antisemitism. She is currently researching the occupation of Austria (1945-1955) at the end of the Second World War, and the nuclear idea in postwar Europe. She has also researched and written extensively on the history of Jews in the former Habsburg regional capital of Czernowitz (now Ukraine).</p>]]>
            </Biography>
            <Expertise>
                <![CDATA[<p>Modern Europe, especially Germany and Central/East Central Europe in the nineteenth and twentieth centuries; European Jewish and women''s history, East European and German film and literature, socialism, war, and revolution.</p>]]>
            </Expertise>
            <Image>http://www.kenyon.edu/images/directory/ablovatski.jpg</Image>
            <Link>http://www.kenyon.edu/directories/campus-directory/biography/eliza-ablovatski/</Link>
            <Books>
                <Book>
                    <Year></Year>
                    <Details>
                        <![CDATA[<p><em>Zwischen Pruth und Jordan. Lebenserinnerungen Czernowitzer Juden</em><em>&nbsp;,&nbsp;</em>with Gaby Coldewey and others K&ouml;ln: B&ouml;hlau Verlag, 2003</p>]]>
                    </Details>
                </Book>
                <Book>
                    <Year></Year>
                    <Details>
                        <![CDATA[<p><em>Czernowitz ist gewen an alt jiddische Stdt: &Uuml;berlebende berichten,</em>&nbsp;With Gaby Coldewey and others. First Edition: Czernowitz,Ukraine: distributed by the Heinrich-B&ouml;ll-Stiftung, 1998 Second Edition: Berlin, 1999 (Third edition: Potsdam, forthcoming 2009)</p>]]>
                    </Details>
                </Book>
            </Books>
            <Articles>
                <Article>
                    <Year></Year>
                    <Details>
                        <![CDATA[<p>"The Central European Revolutions of 1919 and the Myth of Judeo-Bolshevism,"&nbsp;<em>European Review of History, Vol. 17/ Issue 3: Cosmopolitanism, Nationalism and the Jews of East Central Europe (2010), 473-489.</em></p>]]>
                    </Details>
                </Article>
                <Article>
                    <Year></Year>
                    <Details>
                        <![CDATA[<p>"Between Red Army and White Guard: Women in Budapest, 1918-1919," in&nbsp;<em>Gender and War in Twentieth-Century Eastern Europe,</em>&nbsp;edited by Maria Bucur and Nancy Wingfield&nbsp;Bloomington: Indiana University Press 2006</p>]]>
                    </Details>
                </Article>
                <Article>
                    <Year></Year>
                    <Details>
                        <![CDATA[<p>"The Girl with the Titus-head: Women in Revolution in Munich and Budapest, 1919"&nbsp;<em>Nationalities Papers&nbsp;</em>28/3 (September 2000), 541-550</p>]]>
                    </Details>
                </Article>
            </Articles>
            <Papers>
            </Papers>
            <Artwork>
            </Artwork>
            <Websites>
            </Websites>
        </Person>
        <Person>
            <FirstName>One</FirstName>
            <LastName>More</LastName>
            <Biography>Biography: Some interesting facts...</Biography>
            <Expertise>Expertise: Some interesting facts...</Expertise>
            <Image>somepicture.jpg</Image>
            <Link>somelink.com</Link>
            <Books>
                <Book>
                    <Year>2001</Year>
                    <Details>Book1</Details>
                </Book>
                <Book>
                    <Year>2002</Year>
                    <Details>Book2</Details>
                </Book>
            </Books>
            <Articles>
                <Article>
                    <Year>2001</Year>
                    <Details>Article1</Details>
                </Article>
            </Articles>
            <Papers>
            </Papers>
            <Artwork>
            </Artwork>
            <Websites>
            </Websites>
        </Person>
    </People>';
    
    With MyPersonCTE AS
    (
        SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS PersonID
              ,p.value('FirstName[1]','varchar(max)') AS FirstName
              ,p.value('LastName[1]','varchar(max)') AS LastName
              ,p.value('Biography[1]','varchar(max)') AS Biography
              ,p.value('Expertise[1]','varchar(max)') AS Expertise
              ,p.value('Image[1]','varchar(max)') AS Image
              ,p.value('Link[1]','varchar(max)') AS Link
              ,p.query('Books') AS BookNode
              ,p.query('Articles') AS ArticleNode
              --same for Papers, Artwork...
        FROM @x.nodes('/People/Person') AS A(p) 
    )
    ,MyBooksCTE AS
    (
        SELECT MyPersonCTE.*
              ,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS BookID
              ,x.value('Year[1]','int') AS BookYear
              ,x.value('Details[1]','varchar(max)') AS BookDetails
        FROM MyPersonCTE
        CROSS APPLY MyPersonCTE.BookNode.nodes('/Books/Book') A(x)  
    )
    ,MyArticlesCTE AS
    (
        SELECT MyPersonCTE.*
              ,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS ArticleID
              ,x.value('Year[1]','int') AS ArticleYear
              ,x.value('Details[1]','varchar(max)') AS ArticleDetails
        FROM MyPersonCTE
        CROSS APPLY MyPersonCTE.ArticleNode.nodes('/Articles/Article') A(x)  
    )
    --same for Papers, Artwork...
    SELECT p.*
          ,b.BookID
          ,b.BookYear
          ,b.BookDetails
          ,a.ArticleID
          ,a.ArticleYear
          ,a.ArticleDetails  
    INTO #tempAllData
    FROM MyPersonCTE AS p
    LEFT JOIN MyBooksCTE AS b ON p.PersonID=b.PersonID
    LEFT JOIN MyArticlesCTE AS a ON p.PersonID=a.PersonID ;
    
    --#tempAllData is now filled with all data, copied in all combination: much to much
    --but DISTINCT is your friend
    --in this case you'd use the PersonID as FK in all related tables
    
    SELECT DISTINCT PersonID,FirstName,LastName,Biography,Expertise --other fields
    FROM #tempAllData;
    
    SELECT DISTINCT PersonID,BookID,BookYear,BookDetails
    FROM #tempAllData;
    
    SELECT DISTINCT PersonID,ArticleID,ArticleYear,ArticleDetails
    FROM #tempAllData;
    
    DROP TABLE #tempAllData;
    

    The results

    Persons:

    1   Eliza    Ablovatski     <p>Eliza Ablovatski joined ...
    2   One      More           Biography: Some interesting facts...    
    

    Books

    1   1   0       <p><em>Zwischen Pruth und ...
    1   2   0       <p><em>Czernowitz ist gewen ...
    2   3   2001    Book1
    2   4   2002    Book2
    

    Articles

    1   1   0       <p>"The Central European ...
    1   2   0       <p>"Between Red Army and White ...
    1   3   0       <p>"The Girl with the Titus-head: ...
    2   4   2001    Article1
    

    But what you really want to achieve is the one, big table

    This is only possible with dynamic SQL. Start as above and change the query to the following. It will first find the columnnames automatically and then use a UNION ALL to force all data into the same structure and finally there's a big, dynamic PIVOT:

    Attention: I added a PARTITION BY PersonID to the CTEs related ROW_NUMBERs. This is to get IDs starting with 1for each person

    With MyPersonCTE AS
    (
        SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS PersonID
              ,p.value('FirstName[1]','varchar(max)') AS FirstName
              ,p.value('LastName[1]','varchar(max)') AS LastName
              ,p.value('Biography[1]','varchar(max)') AS Biography
              ,p.value('Expertise[1]','varchar(max)') AS Expertise
              ,p.value('Image[1]','varchar(max)') AS Image
              ,p.value('Link[1]','varchar(max)') AS Link
              ,p.query('Books') AS BookNode
              ,p.query('Articles') AS ArticleNode
              --same for Papers, Artwork...
        FROM @x.nodes('/People/Person') AS A(p) 
    )
    ,MyBooksCTE AS
    (
        SELECT MyPersonCTE.*
              ,ROW_NUMBER() OVER(PARTITION BY PersonID ORDER BY (SELECT NULL)) AS BookID
              ,x.value('Year[1]','int') AS BookYear
              ,x.value('Details[1]','varchar(max)') AS BookDetails
        FROM MyPersonCTE
        CROSS APPLY MyPersonCTE.BookNode.nodes('/Books/Book') A(x)  
    )
    ,MyArticlesCTE AS
    (
        SELECT MyPersonCTE.*
              ,ROW_NUMBER() OVER(PARTITION BY PersonID ORDER BY (SELECT NULL)) AS ArticleID
              ,x.value('Year[1]','int') AS ArticleYear
              ,x.value('Details[1]','varchar(max)') AS ArticleDetails
        FROM MyPersonCTE
        CROSS APPLY MyPersonCTE.ArticleNode.nodes('/Articles/Article') A(x)  
    )
    --same for Papers, Artwork...
    SELECT p.*
          ,b.BookID
          ,b.BookYear
          ,b.BookDetails
          ,a.ArticleID
          ,a.ArticleYear
          ,a.ArticleDetails  
    INTO #tempAllData
    FROM MyPersonCTE AS p
    LEFT JOIN MyBooksCTE AS b ON p.PersonID=b.PersonID
    LEFT JOIN MyArticlesCTE AS a ON p.PersonID=a.PersonID ;
    
    --#tempAllData is now filled with all data, copied in all combination: much to much
    --but DISTINCT is your friend
    --in this case you'd use the PersonID as FK in all related tables
    
    SELECT DISTINCT PersonID,FirstName,LastName,Biography,Expertise --other fields
    INTO #tempPerson
    FROM #tempAllData;
    
    SELECT DISTINCT PersonID,BookID,BookYear,BookDetails
    INTO #tempBooks
    FROM #tempAllData;
    
    SELECT DISTINCT PersonID,ArticleID,ArticleYear,ArticleDetails
    INTO #tempArticles
    FROM #tempAllData;
    
    DECLARE @columnNames VARCHAR(MAX)=
     STUFF((SELECT DISTINCT ',Book_'+CAST(BookID AS VARCHAR(10)) FROM #tempBooks FOR XML PATH('')),1,1,'')
    +(SELECT DISTINCT ',Article_'+CAST(ArticleID AS VARCHAR(10)) FROM #tempArticles FOR XML PATH(''));
    
    DECLARE @cmd VARCHAR(MAX)=
    'SELECT p.*
    FROM
    (
        SELECT p.*
              ,''Book_''+CAST(BookID AS VARCHAR(10)) AS ColumnName
              ,ISNULL(CAST(BookYear AS VARCHAR(4)),'''') + '' '' + BookDetails AS Data
        FROM #tempPerson AS p
        INNER JOIN #tempBooks AS b ON p.PersonID=b.PersonID
        UNION ALL
        SELECT p.*
              ,''Article_''+CAST(ArticleID AS VARCHAR(10)) AS ColumnName
              ,ISNULL(CAST(ArticleYear AS VARCHAR(4)),'''') + '' '' + ArticleDetails AS Data
        FROM #tempPerson AS p
        INNER JOIN #tempArticles AS a ON p.PersonID=a.PersonID
    ) AS tbl
    PIVOT
    (
        MAX(Data) FOR ColumnName IN(' +  @columnNames + ')
    ) AS p;'
    
    
    EXEC(@cmd);
    
    DROP TABLE #tempArticles
    DROP TABLE #tempBooks 
    DROP TABLE #tempPerson
    DROP TABLE #tempAllData;