Search code examples
phpregexxmlstringprocessing-instruction

Remove Processing Instruction (<?xml tags and content) from XML String


I have this tag in a string:

<?xml:namespace prefix = o /?>

How do I remove that and similar tags from the string with PHP and regex?

I tried:

$clean = preg_replace('/<\?xml[^>]+\/>/im', '', $dirty);

Solution

  • What you have in that string is a Processing Instruction (PI, see XML 1.0).

    If you want to remove those PIs from a string that you expect to be UTF-8 encoded w/o making use of the PCRE UTF-8 modifier, you can use the following pattern:

    ~
        <\?
        (?: [A-Za-z_:] | [^\x00-\x7F] ) (?: [A-Za-z_:.-] | [^\x00-\x7F] )*
        (?: \?> | \s (?: [^?]* \?+ ) (?: [^>?] [^?]* \?+ )* >)
    ~x
    

    It is a translation from a REX expression for XML Processing Instructions to a PCRE expression as used in PHP.

    A code example:

    $str = "some string <?xml:namespace prefix = o /?> that is";
    
    $pattern = '~
        <\?
        (?: [A-Za-z_:] | [^\x00-\x7F] ) (?: [A-Za-z_:.-] | [^\x00-\x7F] )*
        (?: \?> | \s (?: [^?]* \?+ ) (?: [^>?] [^?]* \?+ )* >)
    ~x';
    
    echo preg_replace($pattern, '', $str);
    

    The output:

    some string  that is
    

    Different to the previous answer given is that this regular expression does ...

    • ... take the closing sequence ("?>") correctly into account. Especially a ">" can be allowed in a processing instruction.
    • ... there is no requirement to limit the name of the processing instruction to start with "xml" only.
    • ... it actually looks for a name as part of the opening sequence.
    • ... deals with empty and non-empty processing instructions.

    Some notes worth to mention about the limitations:

    1. The pattern is intended for shallow parsing. That is, if you yet haven't stripped other tag constructs from the string that could contain text which again could look like such a processing instruction (e.g. a CDATA block or a comment), then the pattern would match wrongly.
    2. The pattern matches an XML Declaration which starts with "<?xml" as well. This can be changed by not looking for XML reserved names after the opening "<?" with a negative lookahead like "(?! [xX][mM][lL] (?: \?> | \s ) )".

    Because of these limitations it's perhaps worth to consider

    Alternatives to Regular Expressions

    First of all, it can be much easier to just use PHP's strip_tags to strip the processing instructions. It will remove other tags and comments, too. This might not be always wanted, it's just really straight forward:

    strip_tags($str)
    

    Much more explicit as both the regular expression and strip_tags is using one of the XML parsers that ship with PHP to strip the processing instructions. For example PHP's DOM extension. It can be wrapped in a function to be easily applied on a string:

    dom_strip_pis($str)
    

    Such an exemplary function also works with the XML string you have which is using the reserved name "xml" as prefix which is actually not really correct in XML. But the parser won't choke on it:

    /**
     * remove processing instructions from an XML string
     *
     * @author hakre <http://hakre.wordpress.com>
     *
     * @param string $xml
     * @return string
     */
    function dom_strip_pis($str) {
        $doc = new DOMDocument;
        $fragment =  $doc->createDocumentFragment();
        $saved = libxml_use_internal_errors(true);
        $fragment->    appendXML($str);
        libxml_use_internal_errors($saved);
        foreach($fragment->childNodes as $node) {
            if ($node instanceof DOMProcessingInstruction) {
                $node->parentNode->removeChild($node);
            }
        }
        return $doc->saveXML($fragment);
    }
    

    Using an XML parser as given in the last example won't have you to deal with shallow parsing.