Search code examples
javautf-8character-encodingiso-8859-1windows-1252

How to normalize text content to UTF 8 in java


We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).

When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.

Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.

Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.

Thoughts on how to best approach the problem?

Thanks


Solution

  • You can use ICU4J's CharsetDetector