Search code examples
phputf-8sanitization

Sanitise UTF-8 in PHP


The function json_encode requires a valid UTF-8 string. I have a string that may be in a different encoding. I need to ignore or substitute all invalid characters to be able to convert to JSON.

  1. It should be something very simple and robust.
  2. The error is in a module for manual checking, so mojibake is fine.
  3. The code responsible for fixing encoding is in a different module. (It was broken, though.) I don’t want to duplicate responsibility.

The hexadecimal representation of an example of an invalid string: 496e76616c6964206d61726b2096

My current solution:

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = @\iconv('UTF-8', 'UTF-8//IGNORE', $raw_str);

The three problems with my code:

  1. The iconv looks little too heavy.
  2. Many programmers don't like @.
  3. The iconv may ignore too much: the whole string.

Any better idea?

There is similar question, Ensuring valid UTF-8 in PHP, but I don't care about conversion.


Solution

  • I think this is the best solution.

    $raw_str = hex2bin('496e76616c6964206d61726b2096');
    $sane_str = mb_convert_encoding($raw_str, 'UTF-8', 'UTF-8');