How to detect invalid html entities in PHP?

I have a bunch of text/html documents I'm processing

Some of them contain encoded html entities which I'm trying to convert into their raw decoded utf characters.

This is easy using html_entity_decode, however, some of the entities are invalid such as

&#x99999;

For this reason I'm using a regexp to pull out every individual entity, and then trying to validate them somehow.

If an entity is invalid, I want to leave it as 򙦙 in the document, but things like an encoded & would still become &.

Just some sample test code I knocked up..

<?php
function dump_chars($s)
{
    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = html_entity_decode($m, ENT_QUOTES, "UTF-8");

            echo "[" . htmlentities($m, ENT_QUOTES, "UTF-8") . "] ";
            echo "Decoded: [" . $decoded . "] ";
            echo "Hex: [" . bin2hex($decoded) . "] "; 
            echo "detect: [" . mb_detect_encoding($decoded) . "]";
            echo "<br>";
        }
    }
}

$payload = "&quot; &amp; &#x349; &#x92; &#x99999;";
echo "<html><head><meta charset='UTF-8'></head><body>";
dump_chars($payload);

I'm drawing a bit of a blank how best to validate the entity, would love some help please.

Solution

I eventually found a way..

function decode_numeric_entities($s)
{
    $result = $s;
    $convmap = array(0x0, 0x2FFFF, 0, 0xFFFF);

    if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
    {
        foreach ($matches[0] as $m)
        {
            $decoded = mb_decode_numericentity($m, $convmap, 'UTF-8');
            $result = str_replace($m, $decoded, $result);
        }
    }
    return $result;
}

Running a string through this func will convert all valid entities to their actual utf characters, leaving all the invalid ones left as entities