Search code examples
htmlutf-8character-encodingnon-unicode

How do I fix invalid HTML characters in pages served with different encoding?


I have a number of websites that are rendering invalid characters. The pages' meta tags specify UTF-8 encoding. However, a number of pages contain characters that can't be interpreted by UTF-8, probably because the files were saved with another encoding (such as ANSI). The one in particular I'm concerned about right now is a fancy apostrophe (as in "Bob’s"...sorry if that doesn't show up correctly). W3's validator indicates the entity is "\x92", but it won't validate the file because it doesn't map to unicode. And, of course, if I open the file in Notepad++ and change the encoding to UTF-8, the character is replaced by a 92 in a black box.

Here's my question: what's the easiest way to fix this? Do I have to open all the pages and replace that character with a conventional apostrophe? Or is there a quick fix I could add (say, to IIS) that might override or fix the encoding issue? Or do I have to brute-force find/replace? I have hundreds of pages on these websites and I have no idea how many of them I'd have to change, so if anyone knows a way I could either circumvent this problem or fix it quickly I would appreciate it.


Solution

  • Are you serving the pages as straight HTML, or do you have another script serving the content? If you have a script which is serving the content, that script could just look for any instance of \x92 and replace it with an apostrophe. In PHP this would be a simple str_replace()

    If you're serving straight HTML then you'll have to actually modify the files themselves. This can be automated, however (and probably should be if you have hundreds of files) depending on what tools you have available to you and what Operating System you're in. Since you said you're using Notepad++ I suppose it's safe to assume you're in MS Windows (therefore no fun Unix commands to speed things up)

    It may be possible to create a BATCH script which can do this, however. There are very simple ASCII text editing tools built into Command Prompt. If that's not possible then it's very possible to make a C or C++ program to do this if you have a compiler on your system and moderate knowledge of C. If you have the former and not the latter, ask and I'll whip up some source for you.