I found a freely available data dump of USPTO patent data in XML format. Part of the XML for most of the patents has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US09226443-20160105.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20151221" date-publ="20160105">
...
<claims>
...
<claim id="CLM-00015" num="00015">
<claim-text>15. The system of <claim-ref idref="CLM-00013">claim 13</claim-ref>, wherein ...</claim-text>
</claim>
</claims>
</us-patent-grant>
When I execute the PHP simplexml_load_string
function on the XML, the <claim-ref idref="CLM-00013">claim 13</claim-ref>
part goes away and I'm left with the following for the claim text:
15. The system of , wherein ...
I tried executing the simplexml_load_string
function as follows:
$xml = simplexml_load_string($xmlTxt, 'SimpleXMLElement', LIBXML_NOCDATA);
But it didn't change anything.
What do I need to do in order to get the text within the claim-ref
tags to be retained as part of the CDATA within the claim-text
tags? Please note that I don't need to retain the actual claim-ref
tags, just the text within them.
Here is no CDATA section in your example XML. A CDATA section looks like this in XML:
<foo><![CDATA[<bar>text</bar>]]></foo>
The CDATA section is a single text node in this case. It is compareable to:
<foo><bar>text</bar></foo>
If you need the text content of a SimpleXMLElement (including it's descendants) you can convert it into a DOM node. The DOMElement::$textContent property provides it.
$patentGrant = new SimpleXMLElement($xml);
$node = dom_import_simplexml($patentGrant->claims->claim->{'claim-text'});
var_dump($node->textContent);
Output:
string(39) "15. The system of claim 13, wherein ..."
Or you use DOMXpath::evaluate()
, without SimpleXML at all:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
var_dump($xpath->evaluate('string(/us-patent-grant/claims/claim/claim-text)'));