Search code examples
phpxmllaravelsimplexml

Premature end of data reading an XML from an API


I receive a XML from an API supplied by a company.

When I try to read it most products import just fine - except this one.

<?xml version="1.0" encoding="utf-8"?>
<pfeed lastaccess="31-12-2010 00:00:00">
<p>
    <p_descs lastmodified="1-4-2022 05:28:25">
        <p_desc_std_N lastmodified="31-3-2022 00:17:37">
            <![CDATA[Test product]]>
        </p_desc_std_N>
        <p_desc_ext_N lastmodified="31-3-2022 00:21:31">
            <![CDATA[<h3>Test product</h3><div class="eq items-block with-gutter items-50-50-100" data-minwidth="" data-maxwidth=""><div class="item item-2"><div class="item-stylable">Lorem ipsum</div></div></div><p><strong>Dolor</strong> justo ultricies vehicula<br /></p>]]>
        </p_desc_ext_N>
    </p_descs>
</p>
</pfeed>

The problem here is that each product has a node p - /p.
In the CDATA there is also a p - /p which results that the import for that product fails and returns

simplexml_load_string(): Entity: line 10: parser error : Premature end of data in tag p line 2"""

If I remove the paragraph html tags it runs smoothly.

I'm parsing the XML with XmlStringStreamer since it's 500+ MB.
Each node is then read with simple simplexml_load_string

// $node will be a string like this: "<customer><firstName>Jane</firstName><lastName>Doe</lastName></customer>"

            try {
                $simpleXmlNode = simplexml_load_string( $node );
            } catch (\Exception $e) {
                echo $e->getMessage();
                dd($node);
            }

Is there a way to ignore the html in the CDATA entry?
Pulling the XML through an online parser throws no error since the XML is correct - but they are also able to read the XML without a problem, so I think it's a minor problem and I'm looking over the solution.

Thanks in advance.

I tried several approaches like replacing all html in the CDATA tags with htmlentitities or removing the paragraph tag completely but that all failed.


Solution

  • This sounds like a problem with the XmlStringStreamer part. Maybe you need to use it differently.

    The "standard" tool for reading large XML files is XMLReader. You can use it to iterate the p nodes and expand them into DOM. DOM can be imported into SimpleXML.

    $reader = new XMLReader();
    $reader->open(getXMLURI());
    
    // bootstrap DOM for expanded nodes
    $document = new DOMDocument();
    $xpath = new DOMXpath($document);
    
    while ($reader->read() && $reader->localName !== 'p') {
      continue;
    }
    
    while ($reader->localName === 'p') {
      // expand to DOM - this will load the current node and all descendants
      $p = $reader->expand($document);
      // use xpath to access values ...
      var_dump($xpath->evaluate('string(p_descs/@lastmodified)', $p));
      // ... or nodes ...
      foreach ($xpath->evaluate('p_descs/*', $p) as $node) {
          var_dump($node->localName, $node->textContent);
      }
      // ... or import to SimpleXML
      $pElement = simplexml_import_dom($p);
      
      // go to following "p" sibling node
      $reader->next('p');
    }
    $reader->close();
    
    function getXMLUri() {
        $data = <<<'XML'
    <?xml version="1.0" encoding="utf-8"?>
    <pfeed lastaccess="31-12-2010 00:00:00">
    <p>
        <p_descs lastmodified="1-4-2022 05:28:25">
            <p_desc_std_N lastmodified="31-3-2022 00:17:37">
                <![CDATA[Test product]]>
            </p_desc_std_N>
            <p_desc_ext_N lastmodified="31-3-2022 00:21:31">
                <![CDATA[<h3>Test product</h3><div class="eq items-block with-gutter items-50-50-100" data-minwidth="" data-maxwidth=""><div class="item item-2"><div class="item-stylable">Lorem ipsum</div></div></div><p><strong>Dolor</strong> justo ultricies vehicula<br /></p>]]>
            </p_desc_ext_N>
        </p_descs>
    </p>
    </pfeed>
    XML;
    return 'data://text/xml;base64,'.base64_encode($data);
    }