I'm trying to use a series of string.replaceAlls to swap all the UTF-8 special characters in a text file with ASCII & HTML encoding. Along the way I've hit a particularly stubborn one: \uAC8B, the UTF-8 middot.
Here's the line that cuts out the character, half the time:
string_out = string_out.replaceAll("¬ï", "·");
("¬ï" is how a UTF-8 · appears as extended ASCII. Before stumbling on this line, I'd tried "\uAC8B" and many other encodings without success.)
The line cuts out the UTF-8 middot, it doesn't replace it, and it does that only half the time. The other half the time it misses the character, and leaves it unchanged. If I make multiple copies of it or move other lines around it, it doesn't even do that.
This feels like a multithreading issue, but I'm not aware of any multithreading going on. Just a block of replaceAlls in a included .jsp file being run from another .jsp.
What could cause this race-condition like behavior?
AC8B is not a dot, it's a Chinese character. Did you mean 00B7?
Java strings are always UTF-16 Unicode. UTF-8 is a way of representing Unicode characters in a file, it is not the way Java strings are stored in memory.
Pay attention to the encoding used to read the input and write the output files, they should be UTF-8, but once the file contents have been read into a Java string, it won't be UTF-8 anymore, but 16-bit Unicode.
I think your best chance is using the correct Unicode escape, not trying to represent UTF-8 raw bytes as ASCII.