Search code examples
phpjsonhtml-entities

Convert HTML entities in Json back to characters


Problem and original data

I have a json data which contain some HTML entities to encode some special characters (mostly from French language, like “é”, “ç”, “à”, etc.) and for html tags. This is a sample of my json data:

{
    "data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

Desired result

{
    "data1": "<p>Le cartulaire de 1380-1381 copié au XVIIe siècle et aujourd’hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l’abbaye du Val-Benoît</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

So, I wish to simply decode all HTML entities back to their respective characters and tags. I try to do this with php.

There is my current code:

/* decode data */

$jsonData = '{
        "data1": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
        "data2": "&lt;p&gt;&lt;strong&gt;*&lt;/strong&gt; Joseph CUVELIER, &lt;em&gt;Cartulaire de l&amp;rsquo;abbaye du Val-Beno&amp;icirc;t&lt;/em&gt;, Bruxelles, 1906, p. XI-XXVII.&lt;/p&gt;"
    }';
$data = json_decode($jsonData, true);

/* change html entities and re-encode data */

$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
header('Content-Type: application/json; Charset="UTF-8"');
echo json_encode($data, JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);

My current result:

{
    "data1": "<p>Le cartulaire de 1380-1381 copi&eacute; au XVIIe si&egrave;cle et aujourd&rsquo;hui perdu<strong>*</strong>.",
    "data2": "<p><strong>*</strong> Joseph CUVELIER, <em>Cartulaire de l&rsquo;abbaye du Val-Beno&icirc;t</em>, Bruxelles, 1906, p. XI-XXVII.</p>"
}

So, HTML tags were well transformed. But the HTML entities for French special characters stay here (but instead, for example &amp;eacute; now I have &eacute; ).

Question. How I can convert HTML entities back to characters?

You can test it online here: https://www.tehplayground.com/Z4uB5KIPPo4UQ4h1

Many thanks in advance!

UPDATE:

Finally, my data is more complex than I was imagining. In the same data some characters were preserved as “é”, “à”, “ç” etc. and some other characters was converted to HTM entities. So I can have something like this:

{
    "someData1":
    {
        "data1":
        [
            "ecclésiastique"
        ],
        "data2": "s&amp;eacute;culiers"
    },
    "someData2":
    [
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        },
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        }
    ]
}

So, I suppose I have to 1) Convert all data to HTML entities; 2) Convert all HTML entities back to characters…

There is my current code:

# Get data

$jsonData = '{
    "someData1":
    {
        "data1":
        [
            "ecclésiastique"
        ],
        "data2": "s&amp;eacute;culiers"
    },
    "someData2":
    [
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        },
        {
            "anotherData1": "ecclésiastique",
            "anotherData2": "&lt;p&gt;Le cartulaire de 1380-1381 copi&amp;eacute; au XVIIe si&amp;egrave;cle et aujourd&amp;rsquo;hui perdu&lt;strong&gt;*&lt;/strong&gt;.",
            "anotherData3":
            {
                "text1": "texte here",
                "text2": "texte here"
            }
        }
    ]
}';

$data = json_decode($jsonData, true);

# Convert character encoding

$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");

# Convert HTML entities to their corresponding characters

function html_decode(&$item){
    $item = html_entity_decode($item);
}

array_walk_recursive($data, 'html_decode');

var_dump ($data);

So, I succeed in reversing the encoding. These who was an HTML entities become special characters, and those who was a special character become HTML entities.

But I don't have any idea how to get only special characters.

Online test: https://www.tehplayground.com/bVo3Jr5O7L9p4MXX


Solution

  • There is the solution. I needed to

    1. convert &amp; to & to standardize encoding systems;
    2. convert all applicable characters to HTML entities.

    There is the final code. Many thanks to all for all your comments and suggestions.

    Full code and online test here: https://www.tehplayground.com/zythX4MUdF3ric4l

    array_walk_recursive($data, function(&$item, $key) {
        if(is_string($item)) {
            $item = str_replace("&amp;", "&", $item); // 1. Replace &amp; by &
            $item = html_entity_decode($item); // 2. Convert HTML entities to their corresponding characters
        }
    });