I have a bunch of text/html documents I'm processing
Some of them contain encoded html entities which I'm trying to convert into their raw decoded utf characters.
This is easy using html_entity_decode
, however, some of the entities are invalid such as
򙦙
For this reason I'm using a regexp to pull out every individual entity, and then trying to validate them somehow.
If an entity is invalid, I want to leave it as 򙦙
in the document, but things like an encoded &
would still become &
.
Just some sample test code I knocked up..
<?php
function dump_chars($s)
{
if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
{
foreach ($matches[0] as $m)
{
$decoded = html_entity_decode($m, ENT_QUOTES, "UTF-8");
echo "[" . htmlentities($m, ENT_QUOTES, "UTF-8") . "] ";
echo "Decoded: [" . $decoded . "] ";
echo "Hex: [" . bin2hex($decoded) . "] ";
echo "detect: [" . mb_detect_encoding($decoded) . "]";
echo "<br>";
}
}
}
$payload = "" & ͉ ’ 򙦙";
echo "<html><head><meta charset='UTF-8'></head><body>";
dump_chars($payload);
I'm drawing a bit of a blank how best to validate the entity, would love some help please.
I eventually found a way..
function decode_numeric_entities($s)
{
$result = $s;
$convmap = array(0x0, 0x2FFFF, 0, 0xFFFF);
if (preg_match_all('/&[#A-Za-z0-9]+;/', $s, $matches))
{
foreach ($matches[0] as $m)
{
$decoded = mb_decode_numericentity($m, $convmap, 'UTF-8');
$result = str_replace($m, $decoded, $result);
}
}
return $result;
}
Running a string through this func will convert all valid entities to their actual utf characters, leaving all the invalid ones left as entities