Search code examples
phpxmlsimplexml

How to use simplexml_load_string in PHP to get innertext without embedded tags?


I found a freely available data dump of USPTO patent data in XML format. Part of the XML for most of the patents has the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US09226443-20160105.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20151221" date-publ="20160105">
  ...
  <claims>
    ...
    <claim id="CLM-00015" num="00015">
      <claim-text>15. The system of <claim-ref idref="CLM-00013">claim 13</claim-ref>, wherein ...</claim-text>
    </claim>
  </claims>
</us-patent-grant>

When I execute the PHP simplexml_load_string function on the XML, the <claim-ref idref="CLM-00013">claim 13</claim-ref> part goes away and I'm left with the following for the claim text:

15. The system of , wherein ...

I tried executing the simplexml_load_string function as follows:

$xml = simplexml_load_string($xmlTxt, 'SimpleXMLElement', LIBXML_NOCDATA);

But it didn't change anything.
What do I need to do in order to get the text within the claim-ref tags to be retained as part of the CDATA within the claim-text tags? Please note that I don't need to retain the actual claim-ref tags, just the text within them.


Solution

  • Here is no CDATA section in your example XML. A CDATA section looks like this in XML:

    <foo><![CDATA[<bar>text</bar>]]></foo>
    

    The CDATA section is a single text node in this case. It is compareable to:

    <foo>&lt;bar&gt;text&lt;/bar&gt;</foo>
    

    If you need the text content of a SimpleXMLElement (including it's descendants) you can convert it into a DOM node. The DOMElement::$textContent property provides it.

    $patentGrant = new SimpleXMLElement($xml);
    $node = dom_import_simplexml($patentGrant->claims->claim->{'claim-text'});
    
    var_dump($node->textContent);
    

    Output:

    string(39) "15. The system of claim 13, wherein ..."
    

    Or you use DOMXpath::evaluate(), without SimpleXML at all:

    $document = new DOMDocument();
    $document->loadXml($xml);
    $xpath = new DOMXpath($document);
    
    var_dump($xpath->evaluate('string(/us-patent-grant/claims/claim/claim-text)'));