Search code examples
phputf-8character-encoding

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try


I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text); but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/

For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).

I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").

But there must be something that at least has a good try!


Solution

  • What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

    However, you could try doing this:

    iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
    

    Setting it to strict might help you get a better result.