Search code examples
javautf-8latin1

Java: Search in a wrong encoded String without modifying it


I have to find a user-defined String in a Document (using Java), which is stored in a database in a BLOB. When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one).

Some additional information: The document's content is returned as String in "ISO-8859-1" (Latin1). Here is an example, what a String could look like:

Die Erkenntnis, daà der Künstler Schutz braucht, ...

This is how it should look like:

Die Erkenntnis, daß der Künstler Schutz braucht, ...

If I am searching for Künstler it would fail to find it, because it looks for ü but only finds ü.

Is it possible to convert Künstler into Künstler so I can search for the wrong encoded version instead?

Note: We are using the Hibernate Framework for Database access. The original Getter for the Document's Content returns a byte[]. The String is than returned by calling

new String(getContent(), "ISO-8859-1")

The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way.


Solution

  • Okay, looks like I've found a way to mess up the encoding on purpose.

    new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")
    

    By getting the Bytes of the String Künstler in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to Künstler. It's a hell of a hack but seems to work well.