Search code examples
phpxmlcakephpsapmptt

Parsing large XML file with PHP with non-standard nesting of elements (SAP Roadmap File)


Background of issue:

I have a folder with lots of directories, files, attachments, and JavaScript. There is a main core file that is processed by ActiveX to generate a 'JS Tree' type structure made up out nested table after nested table. In short, it is awful.

The problem I am presented with, is to get it loaded into a database so they can apply states to the associated content.

Parsing the XML file is not necessarily a problem for me, however getting the structure to flow correctly is. The file is not nested in a logical manner that would lend itself to easily creating the structure in the database/filesystem. The XML file consists of Structure nodes that contain a bit of information about that node and any relevant content in the filesystem.

I was thinking of loading it into an MPTT type structure, but logically parsing the various nodes into what consists of Child/Parent relationships is where I am stumbling. Below is a sample of this XML file:

   <Structure nodeid="9D565FD65DE9464EA36F005866DBF3AE" ParentID = "6EEB45ED97634C9BB2730D7713255673" IsAddOnNode="True" IsCoreNode = "0" >
      <Name>POS specific remarks</Name>
      <Sequence>1</Sequence>
      <WBS>1.1.3.1</WBS>
      <BackgroundColor>#80FF00</BackgroundColor>
      <FontColor>Black</FontColor>
      <Comments></Comments>
      <References></References>
   </Structure>
   <Structure nodeid="A6F7E2F0728147BB88429545A6C490CA" ParentID = "B17AB99B64664624AAA41E220A9EAE59" IsAddOnNode="False" IsCoreNode = "0" >
      <Name>Execution, Monitoring, and Controlling of Results</Name>
      <Sequence>4</Sequence>
      <WBS>1.1.4</WBS>
      <BackgroundColor></BackgroundColor>
      <FontColor>White</FontColor>
      <Comments>
         <Comment AddOnID = "53539AB26B50472CAA2DF4E428605C87" Version="0.2"></Comment>
      </Comments>
      <References></References>
   </Structure>
   <Structure nodeid="EFCCA56742074A2A859FD1C547850ABA" ParentID = "A6F7E2F0728147BB88429545A6C490CA" IsAddOnNode="False" IsCoreNode = "0" >
      <Name>Project Performance Reports</Name>
      <Sequence>1</Sequence>
      <WBS>1.1.4.1</WBS>
      <BackgroundColor></BackgroundColor>
      <FontColor>White</FontColor>
      <Comments></Comments>
      <References></References>
   </Structure>

When it has been parsed with ActiveX, the structure (on the left navigation pane) is arranged like a standard outline or ordered list:

1. Project Preparation

 1.1 Project Management

     1.1.1 Phase Star-Up

           1.1.1.1 Item 1

           1.1.1.2 Item 2

           1.1.1.3 Item 3

And so forth. To the best of my understanding, these values that denote the section or subsection (1.1.1.2) are stored in the WBS tag of the Structure node. I think what I need to do is to parse those out and create the structure according to that. How to do that is where I am stumped.

Also, there is also a Sequence node that sems to store information about what index child element it is off of its parent element.

What I would LIKE to do

What I would like to do is create a bunch of database entries (preferably in MPTT) so that I can easily generate a nav tree and then I can begin worrying about 'scraping' all the individual files so that I can store their content in the database also. Somehow, I need to parse the WBS node value to create its 'index' within the table.

I am hoping that the solution is more simple than I am anticipating. Suggestions, a prod in the correct direction would be greatly appreciated.

I was planning on using the TreeBehavior within CakePHP to manage this, but I don't necessarily have to use that to process the file.


Solution

  • I may be mistaken but doesn't the:

    <Structure nodeid="EFCCA56742074A2A859FD1C547850ABA" ParentID = "A6F7E2F0728147BB88429545A6C490CA">
    

    give you the Structure's nodeId and the matching parent of it? So you know that EFCCA56742074A2A859FD1C547850ABA is a child of A6F7E2F0728147BB88429545A6C490CA?

    Storing a Tree Data Structure in a RDBMS is a long story since a RDBMS does not have the concept of hierarchy, but there are various models that allow you to accomplish such a task. You could check http://www.slideshare.net/quipo/trees-in-the-database-advanced-data-structures to get started.

    The Adjacency List is probably the easiest way to go about it, but if you use mySQL, since it doesn't have recursive queries, would mean that you have to do lot's of joins to get down to the last node or process the tree in your application layer.