Search code examples
c#xmlxmlwriterxmldom

Merging multiple existence of nodes in under one parent node using C#


I have an XML having multiple <Page Pageid="1"> nodes. All such nodes have <Para Paraid="1"> nodes under them. I want to do make single occurence of <Page> node such that all <Para> nodes belonging to same <Page> node are shown as child of particular page. e.g.

INPUT:

<Page PageID="**1**">
   <Para ParaID="1">
     <some nodes as child of para>
   </Para>
</Page>
<Page PageID="**2**">
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
</Page>
<Page PageID="**1**"> <!Page 1 encountered again>
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
</Page>
<Page PageID="**3**">
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
</Page>

Expected OUTPUT:

<Page PageID="**1**">
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
   <Para ParaID="**2**">           <!all <Para> of Page 1 are under single <Page> node>
     <some nodes as child of para>
   </Para>
</Page>
<Page PageID="**2**">
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
</Page>
<Page PageID="**3**">
   <Para ParaID="**1**">
     <some nodes as child of para>
   </Para>
</Page>

Solution

  • If you are using .NET 3.5, you can use the XDocument family and Linq extensions to make fairly light work of the task:

    var doc1 = XDocument.Parse(stringContainingYourXML);
    var groups = doc1.Root.Elements().ToLookup(elt => elt.Attribute("PageID").Value);
    var unique = groups.AsEnumerable().Select(group => group.First());
    var doc2 = new XDocument(new XElement("root", unique));
    

    The explanation of this is that we are creating a lookup table on line 2, where elements containing the same value for PageID are grouped together. Given your example XML, it takes 4 <Page/> elements and creates 3 groups, with one group containing both PageID="1" elements.

    On line 3, we loop through the 3 groups and extract just the first XML element for one, and on line 4 we jam those 3 elements into a new document. The resulting XML is:

    <root>
      <Page PageID="**1**">
        <Para ParaID="1" />
      </Page>
      <Page PageID="**2**">
        <Para ParaID="**1**" />
      </Page>
      <Page PageID="**3**">
        <Para ParaID="**1**" />
      </Page>
    </root>
    

    Update: 2011/03/12

    The code below takes into account the requirement for paragraphs from duplicate instances of a page to be merged together in an auto-incrementing kind of way.

    The revised solution is pretty awful compared to the previous one, but messing around with the ParaID values (especially in the format they are in) was quite annoying. I'm not proud of this, but here it is:

    using System;
    using System.Linq;
    using System.Text.RegularExpressions;
    using System.Xml.Linq;
    
    namespace SO {
        class Program {
            static void Main(string[] args) {
                var doc1 = XDocument.Parse(xmlstr);
                var groups = doc1.Root.Elements().ToLookup(page => page.Attribute("PageID").Value);
                var doc2 = new XDocument(new XElement("root"));
    
                foreach (var group in groups) {
                    var firstpage = group.First();
                    var startindex = firstpage.Elements("Para").Last().Attribute("ParaID").Value;
                    var lastindex = int.Parse(Regex.Match(startindex, @"\d+").Value);
    
                    // Duplicate pages...
                    firstpage.Add(
                        group.Skip(1)
                             .SelectMany(page => page.Elements("Para"))
                             .Select(
                                 para => {
                                     para.Attribute("ParaID").Value = Regex.Replace(
                                         para.Attribute("ParaID").Value,
                                         @"\d+",
                                         m => (++lastindex).ToString()
                                     );
                                     return para;
                                 }
                             )
                    );
    
                    doc2.Root.Add(firstpage);
                }
    
                Console.WriteLine(doc2);
                Console.ReadKey(true);
            }
        }
    }