Search code examples
javaencodingutf-8randomaccessfileiso-8859-15

Write to a file with a specific encoding in Java


This might be related to my previous question (on how to convert "för" to "för")

So I have a file that I create in my code. Right now I create it by the following code:

FileWriter fwOne = new FileWriter(wordIndexPath);
BufferedWriter wordIndex = new BufferedWriter(fwOne);

followed by a few

wordIndex.write(wordBuilder.toString()); //that's a StringBuilder

ending (after a while-loop) with a

wordIndex.close();

Now the problem is later on this file is huge and I want (need) to jump in it without going through the entire file. The seek(long pos) method of RandomAccessFile lets me do this.

Here's my problem: The characters in the file I've created seem to be encoded with UTF-8 and the only info I have when I seek is the character-position I want to jump to. seek(long pos) on the other hand jumps in bytes, so I don't end up in the right place since an UTF-8 character can be more than one byte.

Here's my question: Can I, when I write the file, write it in ISO-8859-15 instead (where a character is a byte)? That way the seek(long pos) will get me in the right position. Or should I instead try to use an alternative to RandomAccessFile (is there an alternative where you can jump to a character-position?)


Solution

  • Now first the worrisome. FileWriter and FileReader are old utility classes, that use the default platform settings on that computer. Run elsewhere that code will give a different file, will not be able to read a file from another spot.

    ISO-8859-15 is a single byte encoding. But java holds text in Unicode, so it can combine all scripts. And char is UTF-16. In general a char index will not be a byte index, but in your case it probably works. But the line break might be one \n or two \r\n chars/bytes - platform dependently.

    Re

    Personally I think UTF-8 is well established, and it is easier to use:

    byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
    string = new String(bytes, StandardCharsets.UTF_8);
    

    That way all special quotes, euro, and so on will always be available.

    At least specify the encoding:

    Files.newBufferedWriter(file.toPath(), "ISO-8859-15");