Search code examples
phphtml-entitiesiconv

How to detect invalid html entities in PHP?


I have a bunch of text/html documents I'm processing

Some of them contain encoded html entities which I'm trying to convert into their raw decoded utf characters.

This is easy using html_entity_decode, however, some of the entities are invalid such as

򙦙

For this reason I'm using a regexp to pull out every individual entity, and then trying to validate them somehow.

If an entity is invalid, I want to leave it as 򙦙 in the document, but things like an encoded & would still become &.

Just some sample test code I knocked up..

<?php
function dump_chars($s)
{
    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = html_entity_decode($m, ENT_QUOTES, "UTF-8");

            echo "[" . htmlentities($m, ENT_QUOTES, "UTF-8") . "] ";
            echo "Decoded: [" . $decoded . "] ";
            echo "Hex: [" . bin2hex($decoded) . "] "; 
            echo "detect: [" . mb_detect_encoding($decoded) . "]";
            echo "<br>";
        }
    }
}

$payload = "&quot; &amp; &#x349; &#x92; &#x99999;";
echo "<html><head><meta charset='UTF-8'></head><body>";
dump_chars($payload);

I'm drawing a bit of a blank how best to validate the entity, would love some help please.


Solution

  • I eventually found a way..

    function decode_numeric_entities($s)
    {
        $result = $s;
        $convmap = array(0x0, 0x2FFFF, 0, 0xFFFF);
    
        if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
        {
            foreach ($matches[0] as $m)
            {
                $decoded = mb_decode_numericentity($m, $convmap, 'UTF-8');
                $result = str_replace($m, $decoded, $result);
            }
        }
        return $result;
    }
    

    Running a string through this func will convert all valid entities to their actual utf characters, leaving all the invalid ones left as entities