Search code examples
phpsimplexml

Retrieving XML child nodes when there's mixed content


I've read several questions here that seemed to be related (either directly or indirectly) to the issue I'm having, but none so far have been satisfactory for my specific need, so I thought I would explain my situation, and see if we can come up with an answer together.

I've got a database of XML categories (AIML, specifically) that I would like to use simpleXML functions to parse, to come up with a suitable output. this parsed output is processed from a tag within the selected category. A simple example category looks like this:

<category>  
  <pattern>HOW ARE YOU</pattern>  
  <template>I am fine, how are you?</template> 
</category>

The <template> tag shown above can hold either text, as shown above, or one or more of any number of different AIML tags, either alone, or interspersed with text. The possibilities are virtually endless. Here is a more complex example:

<category>
  <pattern>NESTED RANDOM TEST</pattern>
  <template>
    <random>
      <li>
        <random>
          <li>Choice #1-1</li>
          <li>Choice #1-2</li>
          <li>Choice #1-3</li>
        </random>
      </li>
      <li>
        This is some example text, along with another RANDOM tag:
        <random>
          <li>Choice #2-1</li>
          <li>Choice #2-2</li>
          <li>Choice #2-3</li>
        </random>
      </li>
      <li>
        <random>
          <li>Choice #3-1</li>
          <li>Choice #3-2</li>
          <li>Choice #3-3</li>
        </random>
        This is some text that appears [i]after[/i] a RANDOM tag.
      </li>
    </random>
  </template>
</category>

If the template tag just contains text, or if it only contains other AIML tags, I have no problem with parsing it's contents, but if it has a combination of text and tags, as in the second and third outer <li> sections of the above example, I lose either the tags, if there is text first, or the text, if there's a tag that comes before it. This issue appears no matter how "deep" or "shallow" the text occurs within the tags. Thus, I have a bit of a problem here.

As I've already mentioned, I've read several questions of this nature, and so far I've not found a satisfactory answer. However, I suspect that this could be because I don't fully understand some of the concepts involved, and so may not be implementing some solutions properly. For example, this post mentions "pre-processing" the xml using xslt, and that seems like it would take care of my problem, but I have absolutely no clue on how to implement that. Plus, I'm not using xStream, so I don't even know if this is something that I can implement. I'm afraid that I was never formally trained in PHP, and so my experience is a bit spotty. :)

I hope I've provided enough info to be clear about my situation without being too "wordy".


Solution

  • While this may not be the best way to solve my problem, I've found a rather simple and (to me, at least) somewhat elegant way to handle my problem by using preg_replace() to enclose any instances of plain text in the XML string in <text> tags. Here's what I came up with:

    //First, some simple mixed-content XML:
    $myTemplate = '<template>Hello, <get name="name" />. I\'m glad to meet you.</template>';
    $myTemplate = preg_replace('~>(.*?)<~', '><text>$1</text><', $myTemplate);
    /*
    This can add unnecessary, empty <text> tags under certain circumstances, so the next line
    removes empty tag sets
    */
    $myTemplate = str_replace('<text></text>', '', $myTemplate);
    /*
    This makes the template look like this:
    
    <template><text>Hello, </text><get name="name" /><text>. I\'m glad to meet you.</text></template>
    
    Now, to load my template as XML.
    */
    $xml = new SimpleXMLElement($myTemplate);
    

    From there, I can parse the XML as desired. As I said, this may not be the best way to go about it, but it's effective, and only adds a few lines of code. I'd still love to hear about other methods of handling this, but for now, this will do. I hope this helps someone else. :)